Re: Contributed to spark
Links that was helpful to me during learning about the spark source code: - Articles with "spark" tag in this blog: http://hydronitrogen.com/tag/spark.html - Jacek's "mastering apache spark" git book: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ Hope those can help. On Sat, Apr 8, 2017 at 1:31 AM, Stephen Fletcher wrote: > I'd like to eventually contribute to spark, but I'm noticing since spark 2 > the query planner is heavily used throughout Dataset code base. Are there > any sites I can go to that explain the technical details, more than just > from a high-level prospective >
Re: Structured streaming and writing output to Cassandra
Thanks Jules. It was helpful. On Fri, Apr 7, 2017 at 8:32 PM, Jules Damji wrote: > This blog that shows how to write a custom sink: https://databricks.com/ > blog/2017/04/04/real-time-end-to-end-integration-with- > apache-kafka-in-apache-sparks-structured-streaming.html > > Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :) > > On Apr 7, 2017, at 11:23 AM, shyla deshpande > wrote: > > Is anyone using structured streaming and writing the results to Cassandra > database in a production environment? > > I do not think I have enough expertise to write a custom sink that can be > used in production environment. Please help! > >
Re: Why dataframe can be more efficient than dataset?
how would you use only relational transformations on dataset? On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan wrote: > Hi Spark-users, > I came across a few sources which mentioned DataFrame can be more > efficient than Dataset. I can understand this is true because Dataset > allows functional transformation which Catalyst cannot look into and hence > cannot optimize well. But can DataFrame be more efficient than Dataset even > if we only use the relational transformation on dataset? If so, can anyone > give some explanation why it is so? Any benchmark comparing dataset vs. > dataframe? Thank you! > > Shiyuan >
Re: Why dataframe can be more efficient than dataset?
let me try that again. i left some crap at the bottom of my previous email as i was editing it. sorry about that. here it goes: it is because you use Dataset[X] but the actual computations are still done in Dataset[Row] (so DataFrame). well... the actual computations are done in RDD[InternalRow] with spark's internal types to represent String, Map, Seq, structs, etc. so for example if you do: scala> val x: Dataset[(String, String)] = ... scala> val f: (String, String) => Boolean = _._2 != null scala> x.filter(f) in this case you are using a lambda function for the filter. this is a black-box operation to spark (spark cannot see what is inside the function). so spark will now convert the internal representation it is actually using (something like an InternalRow of size 2 with inside of it two objects of type UTF8String) into a Tuple2[String, String], and then call your function f on it. so for this very simply null comparison you are doing a relatively expensive conversion. now compare this to if you have a DataFrame that holds 2 columns of type String. scala> val x: DataFrame = ... x: org.apache.spark.sql.DataFrame = [x: string, y: string] scala> x.filter($"y" isNotNull) spark will parse your expression, and since it has an understanding of what you are trying to do, it can apply the logic directly on the InternalRow, which avoids the conversion. this will be faster. of course you pay the price for this in that you are forced to use a much more constrained framework to express what you want to do, which can lead to some hair pulling at times. On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan wrote: > Hi Spark-users, > I came across a few sources which mentioned DataFrame can be more > efficient than Dataset. I can understand this is true because Dataset > allows functional transformation which Catalyst cannot look into and hence > cannot optimize well. But can DataFrame be more efficient than Dataset even > if we only use the relational transformation on dataset? If so, can anyone > give some explanation why it is so? Any benchmark comparing dataset vs. > dataframe? Thank you! > > Shiyuan >
Re: Why dataframe can be more efficient than dataset?
it is because you use Dataset[X] but the actual computations are still done in Dataset[Row] (so DataFrame). well... the actual computations are done in RDD[InternalRow] with spark's internal types to represent String, Map, Seq, structs, etc. so for example if you do: scala> val x: Dataset[(String, String)] = ... scala> val f: (String, String) => Boolean = _._2 != null scala> x.filter(f) in this case you are using a lambda function for the filter. this is a black-box operation to spark (spark cannot see what is inside the function). so spark will now convert the internal representation it is actually using (something like an InternalRow of size 2 with inside of it two objects of type UTF8String) into a Tuple2[String, String], and then call your function f on it. so for this very simply null comparison you are doing a relatively expensive conversion. now compare this to if you have a DataFrame that holds 2 columns of type String. scala> val x: DataFrame = ... x: org.apache.spark.sql.DataFrame = [x: string, y: string] scala> x.filter($"y" isNotNull) spark will parse your expression, and since it has an understanding of what you are trying to do, it can apply the logic directly on the InternalRow, which avoids the conversion. this will be faster. of course you pay the price for this in that you are forced to use a much more constrained framework to express what you want to do, which can lead to some hair pulling at times. so when you do a lambda operation on type X, this is black want to use X spark needs to convert these InternalRows to X and then convert the result back to InternalRows. (so DataFrame) using spark's internal types for string, seq, map, etc. so any time you actually need an X there is conversion from Row to X, and from internal representations to your representations of the data) and back going on. this is whats the encoders are used for. 2) some optimizations aren't working yet for Dataset[X] 3) since type X and the lambdas that you define that perform on it are somewhat of a black box to spark there is less room for optimization. On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan wrote: > Hi Spark-users, > I came across a few sources which mentioned DataFrame can be more > efficient than Dataset. I can understand this is true because Dataset > allows functional transformation which Catalyst cannot look into and hence > cannot optimize well. But can DataFrame be more efficient than Dataset even > if we only use the relational transformation on dataset? If so, can anyone > give some explanation why it is so? Any benchmark comparing dataset vs. > dataframe? Thank you! > > Shiyuan >
Re: Why dataframe can be more efficient than dataset?
As far as I am aware in newer Spark versions a DataFrame is the same as Dataset[Row]. In fact, performance depends on so many factors, so I am not sure such a comparison makes sense. > On 8. Apr 2017, at 20:15, Shiyuan wrote: > > Hi Spark-users, > I came across a few sources which mentioned DataFrame can be more > efficient than Dataset. I can understand this is true because Dataset allows > functional transformation which Catalyst cannot look into and hence cannot > optimize well. But can DataFrame be more efficient than Dataset even if we > only use the relational transformation on dataset? If so, can anyone give > some explanation why it is so? Any benchmark comparing dataset vs. > dataframe? Thank you! > > Shiyuan - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Assigning a unique row ID
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram wrote: > Hi, > > We use monotonically_increasing_id() as well, but just cache the table > first like Ankur suggested. With that method, we get the same keys in all > derived tables. > Ah, okay, awesome. Let me give that a go. > > Thanks, > Subhash > > Sent from my iPhone > > On Apr 7, 2017, at 7:32 PM, Everett Anderson > wrote: > > Hi, > > Thanks, but that's using a random UUID. Certainly unlikely to have > collisions, but not guaranteed. > > I'd rather prefer something like monotonically_increasing_id or RDD's > zipWithUniqueId but with better behavioral characteristics -- so they don't > surprise people when 2+ outputs derived from an original table end up not > having the same IDs for the same rows, anymore. > > It seems like this would be possible under the covers, but would have the > performance penalty of needing to do perhaps a count() and then also a > checkpoint. > > I was hoping there's a better way. > > > On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith wrote: > >> http://stackoverflow.com/questions/37231616/add-a-new-column >> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator >> >> >> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson < >> ever...@nuna.com.invalid> wrote: >> >>> Hi, >>> >>> What's the best way to assign a truly unique row ID (rather than a hash) >>> to a DataFrame/Dataset? >>> >>> I originally thought that functions.monotonically_increasing_id would >>> do this, but it seems to have a rather unfortunate property that if you add >>> it as a column to table A and then derive tables X, Y, Z and save those, >>> the row ID values in X, Y, and Z may end up different. I assume this is >>> because it delays the actual computation to the point where each of those >>> tables is computed. >>> >>> >> >> >> -- >> >> -- >> Thanks, >> >> Tim >> > >
Why dataframe can be more efficient than dataset?
Hi Spark-users, I came across a few sources which mentioned DataFrame can be more efficient than Dataset. I can understand this is true because Dataset allows functional transformation which Catalyst cannot look into and hence cannot optimize well. But can DataFrame be more efficient than Dataset even if we only use the relational transformation on dataset? If so, can anyone give some explanation why it is so? Any benchmark comparing dataset vs. dataframe? Thank you! Shiyuan