is there a way to persist the lineages generated by spark?

2017-04-03 Thread kant kodali
Hi All, I am wondering if there a way to persist the lineages generated by spark underneath? Some of our clients want us to prove if the result of the computation that we are showing on a dashboard is correct and for that If we can show the lineage of transformations that are executed to get to th

Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

2017-04-03 Thread Andrew Ray
You probably don't want null safe equals (<=>) with a left join. On Mon, Apr 3, 2017 at 5:46 PM gjohnson35 wrote: > The join condition with && is throwing an exception: > > val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=> > baseDF("medical_claim_id") > && mccDF("medical_claim_det

Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

2017-04-03 Thread gjohnson35
The join condition with && is throwing an exception: val df = baseDF.join(mccDF, mccDF("medical_claim_id") <=> baseDF("medical_claim_id") && mccDF("medical_claim_detail_id") <=> baseDF("medical_claim_detail_id"), "left") .join(revCdDF, revCdDF("revenue_code_padded_str") <=> mccDF("

Re: Mesos checkpointing

2017-04-03 Thread Timothy Chen
Yes, adding the timeout config should be the only code change required. And just to clarify, this is for reconnecting with Mesos master (not agents) after failover. Tim On Mon, Apr 3, 2017 at 2:23 PM, Charles Allen wrote: > We had investigated internally recently why restarting the mesos agents

Re: Mesos checkpointing

2017-04-03 Thread Charles Allen
We had investigated internally recently why restarting the mesos agents failed the spark jobs (no real reason they should, right?) and came across the data. The other conversation by Yu sparked trying to poke to get some of the tickets updated to spread around any tribal knowledge that is floating

Re: Mesos checkpointing

2017-04-03 Thread Timothy Chen
The only reason is that MesosClusterScheduler by design is long running so we really needed it to have failover configured correctly. I wanted to create a JIRA ticket to allow users to configure it for each Spark framework, but just didn't remember to do so. Per another question that came up in t

Mesos checkpointing

2017-04-03 Thread Charles Allen
As per https://issues.apache.org/jira/browse/SPARK-4899 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver allows checkpointing, but only org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler uses it. Is there a reason for that?

Re: Fwd: [SparkSQL] Project using NamedExpression

2017-04-03 Thread Aviral Agarwal
Hi, I made some progress in binding the expressions to a LogicalPlan and then analyzing the plan. Problem is the Unique Id that are assigned to every expression. def apply(dataFrame: DataFrame, selectExpressions: java.util.List[String]): RDD[InternalRow] = { val schema = dataFrame.schema val