Re: RDD to DataFrame question with JsValue in the mix
On 7/1/2016 6:42 AM, Akhil Das wrote: case class Holder(str: String, js:JsValue) Hello, Thanks! I tried that before posting the question to the list but I keep getting an error such as this even after the map() operation to convert (String,JsValue) -> Holder and then toDF(). I am simply invoking the following: val rddDF:DataFrame = rdd.map(x => Holder(x._1,x._2)).toDF rddDF.registerTempTable("rddf") rddDF.schema.mkString(",") And getting the following: [2016-07-01 11:57:02,720] WARN .jobserver.JobManagerActor [] [akka://JobServer/user/context-supervisor/test] - Exception from job d4c9d145-92bf-4c64-8904-91c917bd61d3: java.lang.UnsupportedOperationException: Schema for type play.api.libs.json.JsValue is not supported at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:718) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:693) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:691) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30) at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414) at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
RDD to DataFrame question with JsValue in the mix
Hello, I have an RDD[(String,JsValue)] that I want to convert into a DataFrame and then run SQL on. What is the easiest way to get the JSON (in form of JsValue) "understood" by the process? Thanks! - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Silly Question on my part...
On 5/16/2016 12:12 PM, Michael Segel wrote: For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the Thrift service node itself, or is that just a reference to the RDD which is contained with contexts on the cluster? You can share RDDs using Apache Ignite - it is a distributed memory grid/cache with tons of additional functionality. The advantage is extra resilience (you can mirror caches or just partition them), you can query the contents of the caches in standard SQL etc. Since the caches persist past the existence of the Spark app, you can share them (obviously). You also get read/write through to SQL or NoSQL databases on the back end for persistence and loading/dumping caches to secondary storage. It is written in Java so very easy to use from Scala/Spark apps. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Structured Streaming in Spark 2.0 and DStreams
On 5/16/2016 9:53 AM, Yuval Itzchakov wrote: AFAIK, the underlying data represented under the DataSet[T] abstraction will be formatted in Tachyon under the hood, but as with RDD's if needed they will be spilled to local disk on the worker of needed. There is another option in case of RDDs - the Apache Ignite project - a memory grid/distributed cache that supports Spark RDDs. The nice thing about Ignite is that everything is done automatically for you, you can also duplicate caches for resiliency, load caches from disk, partition them etc. and you also get automatic spillover to SQL (and NoSQL) capable backends via read/write through capabilities. I think there is also effort to support dataframes. Ignite supports standard SQL to query the caches too. On Mon, May 16, 2016, 19:47 Benjamin Kim> wrote: I have a curiosity question. These forever/unlimited DataFrames/DataSets will persist and be query capable. I still am foggy about how this data will be stored. As far as I know, memory is finite. Will the data be spilled to disk and be retrievable if the query spans data not in memory? Is Tachyon (Alluxio), HDFS (Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL), Object Store (S3, Swift), or any else I can’t think of going to be the underlying near real-time storage system? Thanks, Ben - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark Slack
On 5/16/2016 9:52 AM, Xinh Huynh wrote: I just went to IRC. It looks like the correct channel is #apache-spark. So, is this an "official" chat room for Spark? Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is an official channel on IRC for spark :-) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark Slack
On 5/16/2016 9:30 AM, Paweł Szulc wrote: Just realized that people have to be invited to this thing. You see, that's why Gitter is just simpler. I will try to figure it out ASAP You don't need invitations to IRC and it has been around for decades. You can just go to webchat.freenode.net and login into the #spark channel (or you can use CLI based clients). In addition, Gitter is owned by a private entity, it too requires an account and - what does it give you that is advantageous? You wanted real-time chat about Spark - IRC has it and the channel has already been around for a while :-) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Apache Spark Slack
On 5/16/2016 6:40 AM, Paweł Szulc wrote: I've just created this https://apache-spark.slack.com for ad-hoc communications within the comunity. Everybody's welcome! Why not just IRC? Slack is yet another place to create an account etc. - IRC is much easier. What does Slack give you that's so very special? :-) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Tracking / estimating job progress
On 5/13/2016 10:39 AM, Anthony May wrote: It looks like it might only be available via REST, http://spark.apache.org/docs/latest/monitoring.html#rest-api Nice, thanks! On Fri, 13 May 2016 at 11:24 Dood@ODDO <oddodao...@gmail.com <mailto:oddodao...@gmail.com>> wrote: On 5/13/2016 10:16 AM, Anthony May wrote: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker > > Might be useful How do you use it? You cannot instantiate the class - is the constructor private? Thanks! > > On Fri, 13 May 2016 at 11:11 Ted Yu <yuzhih...@gmail.com <mailto:yuzhih...@gmail.com> > <mailto:yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>>> wrote: > > Have you looked > at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala > ? > > Cheers > > On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO <oddodao...@gmail.com <mailto:oddodao...@gmail.com> > <mailto:oddodao...@gmail.com <mailto:oddodao...@gmail.com>>> wrote: > > I provide a RESTful API interface from scalatra for launching > Spark jobs - part of the functionality is tracking these jobs. > What API is available to track the progress of a particular > spark application? How about estimating where in the total job > progress the job is? > > Thanks! > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> > <mailto:user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org>> > For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org> > <mailto:user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>> > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Tracking / estimating job progress
On 5/13/2016 10:16 AM, Anthony May wrote: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker Might be useful How do you use it? You cannot instantiate the class - is the constructor private? Thanks! On Fri, 13 May 2016 at 11:11 Ted Yu <yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>> wrote: Have you looked at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ? Cheers On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO <oddodao...@gmail.com <mailto:oddodao...@gmail.com>> wrote: I provide a RESTful API interface from scalatra for launching Spark jobs - part of the functionality is tracking these jobs. What API is available to track the progress of a particular spark application? How about estimating where in the total job progress the job is? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Tracking / estimating job progress
I provide a RESTful API interface from scalatra for launching Spark jobs - part of the functionality is tracking these jobs. What API is available to track the progress of a particular spark application? How about estimating where in the total job progress the job is? Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Confused - returning RDDs from functions
On 5/12/2016 10:01 PM, Holden Karau wrote: This is not the expected behavior, can you maybe post the code where you are running into this? Hello, thanks for replying! Below is the function I took out from the code. def converter(rdd: RDD[(String, JsValue)], param:String): RDD[(String, Int)] = { // I am breaking this down for future readability and ease of optimization // as a first attempt at solving this problem, I am not concerned with performance // and pretty, more with accuracy ;) // r1 will be an RDD containing only the "param" method of selection val r1:RDD[(String,JsValue)] = rdd.filter(x => (x._2 \ "field1" \ "field2").as[String].replace("\"","") == param.replace("\"","")) // r2 will be an RDD of Lists of fields (A1-Z1) with associated counts // remapFields returns a List[(String,Int)] val r2:RDD[List[(String,Int)]] = r1.map(x => remapFields(x._2 \ "extra")) // r3 will be flattened to enable grouping val r3:RDD[(String,Int)] = r2.flatMap(x => x) // now we can group by entity val r4:RDD[(String,Iterable[(String,Int)])] = r3.groupBy(x => x._1) // and produce a mapping of entity -> count pairs val r5:RDD[(String,Int)] = r4.map(x => (x._1, x._2.map(y => y._2).sum)) // return the result r5 } If I call on the above function and collectAsMap on the returned RDD, I get an empty Map(). If I copy/paste this code into the caller, I get the properly filled in Map. I am fairly new to Spark and Scala so excuse any inefficiencies - my priority was to be able to solve the problem in an obvious and correct way and worry about making it pretty later. Thanks! On Thursday, May 12, 2016, Dood@ODDO <oddodao...@gmail.com> wrote: Hello all, I have been programming for years but this has me baffled. I have an RDD[(String,Int)] that I return from a function after extensive manipulation of an initial RDD of a different type. When I return this RDD and initiate the .collectAsMap() on it from the caller, I get an empty Map(). If I copy and paste the code from the function into the caller (same exact code) and produce the same RDD and call collectAsMap() on it, I get the Map with all the expected information in it. What gives? Does Spark defy programming principles or am I crazy? ;-) Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Confused - returning RDDs from functions
Hello all, I have been programming for years but this has me baffled. I have an RDD[(String,Int)] that I return from a function after extensive manipulation of an initial RDD of a different type. When I return this RDD and initiate the .collectAsMap() on it from the caller, I get an empty Map(). If I copy and paste the code from the function into the caller (same exact code) and produce the same RDD and call collectAsMap() on it, I get the Map with all the expected information in it. What gives? Does Spark defy programming principles or am I crazy? ;-) Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org