Re: RDD to DataFrame question with JsValue in the mix

2016-07-01 Thread Dood

On 7/1/2016 6:42 AM, Akhil Das wrote:

case class Holder(str: String, js:JsValue)


Hello,

Thanks!

I tried that before posting the question to the list but I keep getting 
an error such as this even after the map() operation to convert 
(String,JsValue) -> Holder and then toDF().


I am simply invoking the following:

val rddDF:DataFrame = rdd.map(x => Holder(x._1,x._2)).toDF
rddDF.registerTempTable("rddf")

rddDF.schema.mkString(",")


And getting the following:

[2016-07-01 11:57:02,720] WARN  .jobserver.JobManagerActor [] 
[akka://JobServer/user/context-supervisor/test] - Exception from job 
d4c9d145-92bf-4c64-8904-91c917bd61d3:
java.lang.UnsupportedOperationException: Schema for type 
play.api.libs.json.JsValue is not supported
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:718)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:693)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:691)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:691)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:630)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:30)
at 
org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
at 
org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RDD to DataFrame question with JsValue in the mix

2016-06-30 Thread Dood

Hello,

I have an RDD[(String,JsValue)] that I want to convert into a DataFrame 
and then run SQL on. What is the easiest way to get the JSON (in form of 
JsValue) "understood" by the process?


Thanks!

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Silly Question on my part...

2016-05-17 Thread Dood

On 5/16/2016 12:12 PM, Michael Segel wrote:

For one use case.. we were considering using the thrift server as a way to 
allow multiple clients access shared RDDs.

Within the Thrift Context, we create an RDD and expose it as a hive table.

The question  is… where does the RDD exist. On the Thrift service node itself, 
or is that just a reference to the RDD which is contained with contexts on the 
cluster?



You can share RDDs using Apache Ignite - it is a distributed memory 
grid/cache with tons of additional functionality. The advantage is extra 
resilience (you can mirror caches or just partition them), you can query 
the contents of the caches in standard SQL etc. Since the caches persist 
past the existence of the Spark app, you can share them (obviously). You 
also get read/write through to SQL or NoSQL databases on the back end 
for persistence and loading/dumping caches to secondary storage. It is 
written in Java so very easy to use from Scala/Spark apps.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Dood

On 5/16/2016 9:53 AM, Yuval Itzchakov wrote:


AFAIK, the underlying data represented under the DataSet[T] 
abstraction will be formatted in Tachyon under the hood, but as with 
RDD's if needed they will be spilled to local disk on the worker of 
needed.





There is another option in case of RDDs - the Apache Ignite project - a 
memory grid/distributed cache that supports Spark RDDs. The nice thing 
about Ignite is that everything is done automatically for you, you can 
also duplicate caches for resiliency, load caches from disk, partition 
them etc. and you also get automatic spillover to SQL (and NoSQL) 
capable backends via read/write through capabilities. I think there is 
also effort to support dataframes. Ignite supports standard SQL to query 
the caches too.


On Mon, May 16, 2016, 19:47 Benjamin Kim > wrote:


I have a curiosity question. These forever/unlimited
DataFrames/DataSets will persist and be query capable. I still am
foggy about how this data will be stored. As far as I know, memory
is finite. Will the data be spilled to disk and be retrievable if
the query spans data not in memory? Is Tachyon (Alluxio), HDFS
(Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL),
Object Store (S3, Swift), or any else I can’t think of going to be
the underlying near real-time storage system?

Thanks,
Ben



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Dood

On 5/16/2016 9:52 AM, Xinh Huynh wrote:

I just went to IRC. It looks like the correct channel is #apache-spark.
So, is this an "official" chat room for Spark?



Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is 
an official channel on IRC for spark :-)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Dood

On 5/16/2016 9:30 AM, Paweł Szulc wrote:


Just realized that people have to be invited to this thing. You see,  
that's why Gitter is just simpler.


I will try to figure it out ASAP



You don't need invitations to IRC and it has been around for decades. 
You can just go to webchat.freenode.net and login into the #spark 
channel (or you can use CLI based clients). In addition, Gitter is owned 
by a private entity, it too requires an account and - what does it give 
you that is advantageous? You wanted real-time chat about Spark - IRC 
has it and the channel has already been around for a while :-)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Dood

On 5/16/2016 6:40 AM, Paweł Szulc wrote:
I've just created this https://apache-spark.slack.com for ad-hoc 
communications within the comunity.


Everybody's welcome!


Why not just IRC? Slack is yet another place to create an account etc. - 
IRC is much easier. What does Slack give you that's so very special? :-)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tracking / estimating job progress

2016-05-13 Thread Dood

On 5/13/2016 10:39 AM, Anthony May wrote:
It looks like it might only be available via REST, 
http://spark.apache.org/docs/latest/monitoring.html#rest-api


Nice, thanks!



On Fri, 13 May 2016 at 11:24 Dood@ODDO <oddodao...@gmail.com 
<mailto:oddodao...@gmail.com>> wrote:


On 5/13/2016 10:16 AM, Anthony May wrote:
>

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker
>
> Might be useful

How do you use it? You cannot instantiate the class - is the
constructor
private? Thanks!

>
> On Fri, 13 May 2016 at 11:11 Ted Yu <yuzhih...@gmail.com
<mailto:yuzhih...@gmail.com>
> <mailto:yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>>> wrote:
>
> Have you looked
> at
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
> ?
>
    > Cheers
>
> On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO
<oddodao...@gmail.com <mailto:oddodao...@gmail.com>
> <mailto:oddodao...@gmail.com <mailto:oddodao...@gmail.com>>>
wrote:
>
> I provide a RESTful API interface from scalatra for
launching
> Spark jobs - part of the functionality is tracking these
jobs.
> What API is available to track the progress of a particular
> spark application? How about estimating where in the
total job
> progress the job is?
>
> Thanks!
>
>
 -
> To unsubscribe, e-mail:
user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
> <mailto:user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>>
> For additional commands, e-mail:
user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>
> <mailto:user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>>
>
>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tracking / estimating job progress

2016-05-13 Thread Dood

On 5/13/2016 10:16 AM, Anthony May wrote:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker

Might be useful


How do you use it? You cannot instantiate the class - is the constructor 
private? Thanks!




On Fri, 13 May 2016 at 11:11 Ted Yu <yuzhih...@gmail.com 
<mailto:yuzhih...@gmail.com>> wrote:


Have you looked
at core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
?

Cheers

On Fri, May 13, 2016 at 10:05 AM, Dood@ODDO <oddodao...@gmail.com
<mailto:oddodao...@gmail.com>> wrote:

I provide a RESTful API interface from scalatra for launching
Spark jobs - part of the functionality is tracking these jobs.
What API is available to track the progress of a particular
spark application? How about estimating where in the total job
progress the job is?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Tracking / estimating job progress

2016-05-13 Thread Dood
I provide a RESTful API interface from scalatra for launching Spark jobs 
- part of the functionality is tracking these jobs. What API is 
available to track the progress of a particular spark application? How 
about estimating where in the total job progress the job is?


Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Confused - returning RDDs from functions

2016-05-13 Thread Dood

  
  
On 5/12/2016 10:01 PM, Holden Karau wrote:
This is not the expected behavior, can you maybe post
  the code where you are running into this?
  


Hello, thanks for replying!

Below is the function I took out from the code.

def converter(rdd: RDD[(String, JsValue)], param:String): RDD[(String, Int)] = {
  // I am breaking this down for future readability and ease of optimization
  // as a first attempt at solving this problem, I am not concerned with performance
  // and pretty, more with accuracy ;)
  // r1 will be an RDD containing only the "param" method of selection
  val r1:RDD[(String,JsValue)] = rdd.filter(x => (x._2 \ "field1" \ "field2").as[String].replace("\"","") == param.replace("\"",""))
  // r2 will be an RDD of Lists of fields (A1-Z1) with associated counts
  // remapFields returns a List[(String,Int)]
  val r2:RDD[List[(String,Int)]] = r1.map(x => remapFields(x._2 \ "extra"))
  // r3 will be flattened to enable grouping
  val r3:RDD[(String,Int)] = r2.flatMap(x => x)
  // now we can group by entity
  val r4:RDD[(String,Iterable[(String,Int)])] = r3.groupBy(x => x._1)
  // and produce a mapping of entity -> count pairs
  val r5:RDD[(String,Int)] = r4.map(x => (x._1, x._2.map(y => y._2).sum))
  // return the result
  r5
}

If I call on the above function and collectAsMap on the returned
RDD, I get an empty Map(). If I copy/paste this code into the
caller, I get the properly filled in Map.

I am fairly new to Spark and Scala so excuse any inefficiencies - my
priority was to be able to solve the problem in an obvious and
    correct way and worry about making it pretty later. 

Thanks!

On Thursday, May 12, 2016, Dood@ODDO <oddodao...@gmail.com>
  wrote:
  Hello all,

I have been programming for years but this has me baffled.

I have an RDD[(String,Int)] that I return from a function after
extensive manipulation of an initial RDD of a different type.
When I return this RDD and initiate the .collectAsMap() on it
from the caller, I get an empty Map().

If I copy and paste the code from the function into the caller
(same exact code) and produce the same RDD and call
collectAsMap() on it, I get the Map with all the expected
information in it.

What gives?

Does Spark defy programming principles or am I crazy? ;-)

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

  
  
  
  -- 
  

  
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
  

  
  


  


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Confused - returning RDDs from functions

2016-05-12 Thread Dood

Hello all,

I have been programming for years but this has me baffled.

I have an RDD[(String,Int)] that I return from a function after 
extensive manipulation of an initial RDD of a different type. When I 
return this RDD and initiate the .collectAsMap() on it from the caller, 
I get an empty Map().


If I copy and paste the code from the function into the caller (same 
exact code) and produce the same RDD and call collectAsMap() on it, I 
get the Map with all the expected information in it.


What gives?

Does Spark defy programming principles or am I crazy? ;-)

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org