Spark streaming get RDD within the sliding window

2016-08-24 Thread Ulanov, Alexander
Dear Spark developers,

I am working with Spark streaming 1.6.1. The task is to get RDDs for some 
external analytics from each timewindow. This external function accepts RDD so 
I cannot use DStream. I learned that DStream.window.compute(time) returns 
Option[RDD]. I am trying to use it in the following code derived from the 
example in programming guide:
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", )

val rdd = lines.window(Seconds(5), 
Seconds(3)).compute(Time(System.currentTimeMillis())) // this does not seem to 
be a proper way to set time

ssc.start()

ssc.awaitTermination()

At the line with rdd I get the following exception: Exception in thread "main" 
org.apache.spark.SparkException: 
org.apache.spark.streaming.dstream.SocketInputDStream@2264e43c has not been 
initialized.

The other option to get RDD from DStream is to use "slice" function. However, 
it is not clear how to use it and I get the same exception with the following 
use:

val rdd = lines.slice(Time(System.currentTimeMillis() - 100), 
Time(System.currentTimeMillis())) // does not seem correct

Could you suggest what is the proper use of "compute" or "slice" functions from 
DStream or another way to get RDD from DStream?

Best regards, Alexander

P.S. I have found the following example that does streaming within the loop, 
however it looks hacky:
https://github.com/chlam4/spark-exercises/blob/master/using-dstream-slice.scala


StateStore with DStreams

2016-08-24 Thread Matt Smith
Are there any examples of how to use StateStore with DStreams?  It seems
like the idea would be to create a new version with each minibatch, but I
don't quite know how to make that happen.  My lame attempt is below.

  def run (ss: SparkSession): Unit = {
val c = new StreamingContext(ss.sparkContext, Seconds(2))
val stm = c.socketTextStream("localhost", )

var version = 0L
stm.foreachRDD { (rdd, t) =>
  val data = rdd
  .map { (s) =>
val Array(k, v) = s.split(" ")
(k, v)
  }
  .mapPartitionsWithStateStore(ss.sqlContext, "/Users/msmith/cp", 1,
version, keySchema, valueSchema) { (store, rows) =>
val data = rows.map { case (k,v) =>
  val keyRow = InternalRow(UTF8String.fromString(k))
  val keyURow = UnsafeProjection.create(keySchema).apply(keyRow)

  val newCount = v.toLong
  val count = store.get(keyURow).map(_.getLong(0)).getOrElse(0L)
  val valRow = InternalRow(count + newCount)
  val valURow = UnsafeProjection.create(valueSchema).apply(valRow)
  store.put(keyURow, valURow)

  val ret = (k, count + newCount)
  println("ret", ret)
  ret
}
lazy val finish = Some(("",0)).flatMap{ case(k,v) =>
println("finish")
version = store.commit()
println("commit", version)
None
}

data ++ finish
  }

  println(data.collectAsMap())

}

c.start() // Start the computation
c.awaitTermination()  // Wait for the computation to terminate

  }


Re: Tree for SQL Query

2016-08-24 Thread Reynold Xin
It's basically the output of the explain command.


On Wed, Aug 24, 2016 at 12:31 PM, Maciej Bryński  wrote:

> Hi,
> I read this article:
> https://databricks.com/blog/2015/04/13/deep-dive-into-
> spark-sqls-catalyst-optimizer.html
>
> And I have a question. Is it possible to get / print Tree for SQL Query ?
>
> Something like this:
>
> Add(Attribute(x), Add(Literal(1), Literal(2)))
>
> Regards,
> --
> Maciek Bryński
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Sean Owen
If you're just varying versions (or things that can be controlled by a
profile, which is most everything including dependencies), you don't
need and probably don't want multiple POM files. Even that wouldn't
mean you can't use classifiers.

I have seen it used for HBase, core Hadoop. I am not sure I've seen it
used for Spark 2 vs 1 but no reason it couldn't be. Frequently
projects would instead declare that as of some version, Spark 2 is
required, rather than support both. Or shim over an API difference
with reflection if that's all there was to it. Spark does both of
those sorts of things itself to avoid having to publish multiple
variants at all. (Well, except for Scala 2.10 vs 2.11!)

On Wed, Aug 24, 2016 at 6:02 PM, Michael Heuer  wrote:
> Have you seen any successful applications of this for Spark 1.x/2.x?
>
> From the doc "The classifier allows to distinguish artifacts that were built
> from the same POM but differ in their content."
>
> We'd be building from different POMs, since we'd be modifying the Spark
> dependency version (and presumably any other dependencies that needed the
> same Spark 1.x/2.x distinction).
>
>
> On Wed, Aug 24, 2016 at 11:49 AM, Sean Owen  wrote:
>>
>> This is also what "classifiers" are for in Maven, to have variations
>> on one artifact and version. https://maven.apache.org/pom.html
>>
>> It has been used to ship code for Hadoop 1 vs 2 APIs.
>>
>> In a way it's the same idea as Scala's "_2.xx" naming convention, with
>> a less unfortunate implementation.
>>
>>
>> On Wed, Aug 24, 2016 at 5:41 PM, Michael Heuer  wrote:
>> > Hello,
>> >
>> > We're a project downstream of Spark and need to provide separate
>> > artifacts
>> > for Spark 1.x and Spark 2.x.  Has any convention been established or
>> > even
>> > proposed for artifact names and/or qualifiers?
>> >
>> > We are currently thinking
>> >
>> > org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and
>> > Scala
>> > 2.10 & 2.11
>> >
>> >   and
>> >
>> > org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x
>> > and
>> > Scala 2.10 & 2.11
>> >
>> > https://github.com/bigdatagenomics/adam/issues/1093
>> >
>> >
>> > Thanks in advance,
>> >
>> >michael
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: GraphFrames 0.2.0 released

2016-08-24 Thread Maciej Bryński
Hi,
Do you plan to add tag for this release on github ?
https://github.com/graphframes/graphframes/releases

Regards,
Maciek

2016-08-17 3:18 GMT+02:00 Jacek Laskowski :

> Hi Tim,
>
> AWESOME. Thanks a lot for releasing it. That makes me even more eager
> to see it in Spark's codebase (and replacing the current RDD-based
> API)!
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Aug 16, 2016 at 9:32 AM, Tim Hunter 
> wrote:
> > Hello all,
> > I have released version 0.2.0 of the GraphFrames package. Apart from a
> few
> > bug fixes, it is the first release published for Spark 2.0 and both scala
> > 2.10 and 2.11. Please let us know if you have any comment or questions.
> >
> > It is available as a Spark package:
> > https://spark-packages.org/package/graphframes/graphframes
> >
> > The source code is available as always at
> > https://github.com/graphframes/graphframes
> >
> >
> > What is GraphFrames?
> >
> > GraphFrames is a DataFrame-based graph engine Spark. In addition to the
> > algorithms available in GraphX, users can write highly expressive
> queries by
> > leveraging the DataFrame API, combined with a new API for motif finding.
> The
> > user also benefits from DataFrame performance optimizations within the
> Spark
> > SQL engine.
> >
> > Cheers
> >
> > Tim
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
Maciek Bryński


Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Have you seen any successful applications of this for Spark 1.x/2.x?

>From the doc "The classifier allows to distinguish artifacts that were
built from the same POM but differ in their content."

We'd be building from different POMs, since we'd be modifying the Spark
dependency version (and presumably any other dependencies that needed the
same Spark 1.x/2.x distinction).


On Wed, Aug 24, 2016 at 11:49 AM, Sean Owen  wrote:

> This is also what "classifiers" are for in Maven, to have variations
> on one artifact and version. https://maven.apache.org/pom.html
>
> It has been used to ship code for Hadoop 1 vs 2 APIs.
>
> In a way it's the same idea as Scala's "_2.xx" naming convention, with
> a less unfortunate implementation.
>
>
> On Wed, Aug 24, 2016 at 5:41 PM, Michael Heuer  wrote:
> > Hello,
> >
> > We're a project downstream of Spark and need to provide separate
> artifacts
> > for Spark 1.x and Spark 2.x.  Has any convention been established or even
> > proposed for artifact names and/or qualifiers?
> >
> > We are currently thinking
> >
> > org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and
> Scala
> > 2.10 & 2.11
> >
> >   and
> >
> > org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x
> and
> > Scala 2.10 & 2.11
> >
> > https://github.com/bigdatagenomics/adam/issues/1093
> >
> >
> > Thanks in advance,
> >
> >michael
>


Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Sean Owen
This is also what "classifiers" are for in Maven, to have variations
on one artifact and version. https://maven.apache.org/pom.html

It has been used to ship code for Hadoop 1 vs 2 APIs.

In a way it's the same idea as Scala's "_2.xx" naming convention, with
a less unfortunate implementation.


On Wed, Aug 24, 2016 at 5:41 PM, Michael Heuer  wrote:
> Hello,
>
> We're a project downstream of Spark and need to provide separate artifacts
> for Spark 1.x and Spark 2.x.  Has any convention been established or even
> proposed for artifact names and/or qualifiers?
>
> We are currently thinking
>
> org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and Scala
> 2.10 & 2.11
>
>   and
>
> org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x and
> Scala 2.10 & 2.11
>
> https://github.com/bigdatagenomics/adam/issues/1093
>
>
> Thanks in advance,
>
>michael

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-08-24 Thread Reynold Xin
Looks like I'm general people like it. Next step is for somebody to take
the lead and implement it.

Tom do you have cycles to do this?

On Wednesday, August 24, 2016, Tom Graves  wrote:

> ping, did this discussion conclude or did we decide what we are doing?
>
> Tom
>
>
> On Friday, May 13, 2016 3:19 PM, Michael Armbrust  > wrote:
>
>
> +1 to the general structure of Reynold's proposal.  I've found what we do
> currently a little confusing.  In particular, it doesn't make much sense
> that @DeveloperApi things are always labeled as possibly changing.  For
> example the Data Source API should arguably be one of the most stable
> interfaces since its very difficult for users to recompile libraries that
> might break when there are changes.
>
> For a similar reason, I don't really see the point of LimitedPrivate.
> The goal here should be communication of promises of stability or future
> stability.
>
> Regarding Developer vs. Public. I don't care too much about the naming,
> but it does seem useful to differentiate APIs that we expect end users to
> consume from those that are used to augment Spark. "Library" and
> "Application" also seem reasonable.
>
> On Fri, May 13, 2016 at 11:15 AM, Marcelo Vanzin  > wrote:
>
> On Fri, May 13, 2016 at 10:18 AM, Sean Busbey  > wrote:
> > I think LimitedPrivate gets a bad rap due to the way it is misused in
> > Hadoop. The use case here -- "we offer this to developers of
> > intermediate layers; those willing to update their software as we
> > update ours"
>
> I think "LimitedPrivate" is a rather confusing name for that. I think
> Reynold's first e-mail better matches that use case: this would be
> "InterfaceAudience(Developer)" and "InterfaceStability(Experimental)".
>
> But I don't really like "Developer" as a name here, because it's
> ambiguous. Developer of what? Theoretically everybody writing Spark or
> on top of its APIs is a developer. In that sense, I prefer using
> something like "Library" and "Application" instead of "Developer" and
> "Public".
>
> Personally, in fact, I don't see a lot of gain in differentiating
> between the target users of an interface... knowing whether it's a
> stable interface or not is a lot more useful. If you're equating a
> "developer API" with "it's not really stable", then you don't really
> need two annotations for that - just say it's not stable.
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> 
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
>
>
>
>
>


Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Ah yes, thank you for the clarification.

On Wed, Aug 24, 2016 at 11:44 AM, Ted Yu  wrote:

> 'Spark 1.x and Scala 2.10 & 2.11' was repeated.
>
> I guess your second line should read:
>
> org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 2.x
> and Scala 2.10 & 2.11
>
> On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer  wrote:
>
>> Hello,
>>
>> We're a project downstream of Spark and need to provide separate
>> artifacts for Spark 1.x and Spark 2.x.  Has any convention been established
>> or even proposed for artifact names and/or qualifiers?
>>
>> We are currently thinking
>>
>> org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and
>> Scala 2.10 & 2.11
>>
>>   and
>>
>> org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x
>> and Scala 2.10 & 2.11
>>
>> https://github.com/bigdatagenomics/adam/issues/1093
>>
>>
>> Thanks in advance,
>>
>>michael
>>
>
>


Re: Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Ted Yu
'Spark 1.x and Scala 2.10 & 2.11' was repeated.

I guess your second line should read:

org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 2.x and
Scala 2.10 & 2.11

On Wed, Aug 24, 2016 at 9:41 AM, Michael Heuer  wrote:

> Hello,
>
> We're a project downstream of Spark and need to provide separate artifacts
> for Spark 1.x and Spark 2.x.  Has any convention been established or even
> proposed for artifact names and/or qualifiers?
>
> We are currently thinking
>
> org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and
> Scala 2.10 & 2.11
>
>   and
>
> org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x
> and Scala 2.10 & 2.11
>
> https://github.com/bigdatagenomics/adam/issues/1093
>
>
> Thanks in advance,
>
>michael
>


Spark 1.x/2.x qualifiers in downstream artifact names

2016-08-24 Thread Michael Heuer
Hello,

We're a project downstream of Spark and need to provide separate artifacts
for Spark 1.x and Spark 2.x.  Has any convention been established or even
proposed for artifact names and/or qualifiers?

We are currently thinking

org.bdgenomics.adam:adam-{core,apis,cli}_2.1[0,1]  for Spark 1.x and Scala
2.10 & 2.11

  and

org.bdgenomics.adam:adam-{core,apis,cli}-spark2_2.1[0,1]  for Spark 1.x and
Scala 2.10 & 2.11

https://github.com/bigdatagenomics/adam/issues/1093


Thanks in advance,

   michael


Re: Anyone else having trouble with replicated off heap RDD persistence?

2016-08-24 Thread Michael Allman
FYI, I've updated the issue's description to include a very simple program 
which reproduces the issue for me.

Cheers,

Michael

> On Aug 23, 2016, at 4:54 PM, Michael Allman  wrote:
> 
> I've replied on the issue's page, but in a word, "yes". See 
> https://issues.apache.org/jira/browse/SPARK-17204 
> .
> 
> Michael
> 
> 
>> On Aug 23, 2016, at 11:55 AM, Reynold Xin > > wrote:
>> 
>> Does this problem still exist on today's master/branch-2.0? 
>> 
>> SPARK-16550 was merged. It might be fixed already.
>> 
>> On Tue, Aug 23, 2016 at 9:37 AM, Michael Allman > > wrote:
>> FYI, I posted this to user@ and have followed up with a bug report: 
>> https://issues.apache.org/jira/browse/SPARK-17204 
>> 
>> 
>> Michael
>> 
>>> Begin forwarded message:
>>> 
>>> From: Michael Allman >
>>> Subject: Anyone else having trouble with replicated off heap RDD 
>>> persistence?
>>> Date: August 16, 2016 at 3:45:14 PM PDT
>>> To: user >
>>> 
>>> Hello,
>>> 
>>> A coworker was having a problem with a big Spark job failing after several 
>>> hours when one of the executors would segfault. That problem aside, I 
>>> speculated that her job would be more robust against these kinds of 
>>> executor crashes if she used replicated RDD storage. She's using off heap 
>>> storage (for good reason), so I asked her to try running her job with the 
>>> following storage level: `StorageLevel(useDisk = true, useMemory = true, 
>>> useOffHeap = true, deserialized = false, replication = 2)`. The job would 
>>> immediately fail with a rather suspicious looking exception. For example:
>>> 
>>> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
>>> 9086
>>> at 
>>> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>>> at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>>> at 
>>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>>> at 
>>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>>> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>>> at 
>>> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>>> at 
>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>> at 
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>>>  Source)
>>> at 
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>>>  Source)
>>> at 
>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>>>  Source)
>>> at 
>>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>> at 
>>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>>> at 
>>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>>> at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>>> at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> 
>>> or
>>> 
>>> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>>> at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>>> at java.util.ArrayList.get(ArrayList.java:429)
>>> at 
>>> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>>> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>>> at 
>>> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>>> at 
>>> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>>> at 

Re: [discuss] separate API annotation into two components: InterfaceAudience & InterfaceStability

2016-08-24 Thread Tom Graves
ping, did this discussion conclude or did we decide what we are doing?
Tom 

On Friday, May 13, 2016 3:19 PM, Michael Armbrust  
wrote:
 

 +1 to the general structure of Reynold's proposal.  I've found what we do 
currently a little confusing.  In particular, it doesn't make much sense that 
@DeveloperApi things are always labeled as possibly changing.  For example the 
Data Source API should arguably be one of the most stable interfaces since its 
very difficult for users to recompile libraries that might break when there are 
changes.
For a similar reason, I don't really see the point of LimitedPrivate.  The goal 
here should be communication of promises of stability or future stability.
Regarding Developer vs. Public. I don't care too much about the naming, but it 
does seem useful to differentiate APIs that we expect end users to consume from 
those that are used to augment Spark. "Library" and "Application" also seem 
reasonable.
On Fri, May 13, 2016 at 11:15 AM, Marcelo Vanzin  wrote:

On Fri, May 13, 2016 at 10:18 AM, Sean Busbey  wrote:
> I think LimitedPrivate gets a bad rap due to the way it is misused in
> Hadoop. The use case here -- "we offer this to developers of
> intermediate layers; those willing to update their software as we
> update ours"

I think "LimitedPrivate" is a rather confusing name for that. I think
Reynold's first e-mail better matches that use case: this would be
"InterfaceAudience(Developer)" and "InterfaceStability(Experimental)".

But I don't really like "Developer" as a name here, because it's
ambiguous. Developer of what? Theoretically everybody writing Spark or
on top of its APIs is a developer. In that sense, I prefer using
something like "Library" and "Application" instead of "Developer" and
"Public".

Personally, in fact, I don't see a lot of gain in differentiating
between the target users of an interface... knowing whether it's a
stable interface or not is a lot more useful. If you're equating a
"developer API" with "it's not really stable", then you don't really
need two annotations for that - just say it's not stable.

--
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





  

Re: Spark dev-setup

2016-08-24 Thread Jacek Laskowski
On Wed, Aug 24, 2016 at 2:32 PM, Steve Loughran  wrote:

> no reason; the key thing is : not in cluster mode, as there your work happens 
> elsewhere

Right! Anything but cluster mode should make it easy (that leaves us
with local).

Jacek

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread Daniel Darabos
You are saying the RDD lineage must be serialized, otherwise we could not
recreate it after a node failure. This is false. The RDD lineage is not
serialized. It is only relevant to the driver application and as such it is
just kept in memory in the driver application. If the driver application
stops, the lineage is lost. There is no recovery.

On Wed, Aug 24, 2016 at 10:20 AM, kant kodali  wrote:

> can you please elaborate a bit more?
>
>
>
> On Wed, Aug 24, 2016 12:41 AM, Sean Owen so...@cloudera.com wrote:
>
>> Byte code, no. It's sufficient to store the information that the RDD
>> represents, which can include serialized function closures, but that's not
>> quite storing byte code.
>>
>> On Wed, Aug 24, 2016 at 2:00 AM, kant kodali  wrote:
>>
>> Hi Guys,
>>
>> I have this question for a very long time and after diving into the
>> source code(specifically from the links below) I have a feeling that the
>> lineage of an RDD (the transformations) are converted into byte code and
>> stored in memory or disk. or if I were to ask another question on a similar
>> note do we ever store JVM byte code or python byte code in memory or disk?
>> This make sense to me because if we were to construct an RDD after a node
>> failure we need to go through the lineage and execute the respective
>> transformations so storing their byte codes does make sense however many
>> people seem to disagree with me so it would be great if someone can clarify.
>>
>> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c
>> 1639fc3d5b7f796cf/python/pyspark/rdd.py#L1452
>>
>> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c
>> 1639fc3d5b7f796cf/python/pyspark/rdd.py#L1471
>>
>> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c
>> 1639fc3d5b7f796cf/python/pyspark/rdd.py#L229
>>
>> https://github.com/apache/spark/blob/master/python/pyspark/
>> cloudpickle.py#L241
>>
>>
>>


Re: Spark dev-setup

2016-08-24 Thread Steve Loughran

> On 24 Aug 2016, at 11:38, Jacek Laskowski  wrote:
> 
> On Wed, Aug 24, 2016 at 11:13 AM, Steve Loughran  
> wrote:
> 
>> I'd recommend
> 
> ...which I mostly agree to with some exceptions :)
> 
>> -stark spark standalone from there
> 
> Why spark standalone since the OP asked about "learning how query
> execution flow occurs in Spark SQL"? How about spark-shell in local
> mode? Possibly explain(true) + conf/log4j.properties as the code might
> get tricky to get right at the very beginning.
> 


no reason; the key thing is : not in cluster mode, as there your work happens 
elsewhere


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark dev-setup

2016-08-24 Thread Jacek Laskowski
On Wed, Aug 24, 2016 at 11:13 AM, Steve Loughran  wrote:

> I'd recommend

...which I mostly agree to with some exceptions :)

> -stark spark standalone from there

Why spark standalone since the OP asked about "learning how query
execution flow occurs in Spark SQL"? How about spark-shell in local
mode? Possibly explain(true) + conf/log4j.properties as the code might
get tricky to get right at the very beginning.

#justcurious

Jacek

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark dev-setup

2016-08-24 Thread Steve Loughran

On 24 Aug 2016, at 07:10, Nishadi Kirielle 
> wrote:

Hi,
I'm engaged in learning how query execution flow occurs in Spark SQL. In order 
to understand the query execution flow, I'm attempting to run an example in 
debug mode with intellij IDEA. It would be great if anyone can help me with 
debug configurations.

I'd recommend

-check out the version of spark you want to use
-set breakpoints wherever you want
-use dev/make-distribution.sh to build the release in dist/
-stark spark standalone from there
-attach to it in the IDE debugger

submit the work/type in queries in the REPL

this gives you the full launch with the complete classpath and env setup.

Otherwise: pull it out into a junit test and try to use IDEAs test runner to 
run it.


Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread kant kodali

can you please elaborate a bit more?





On Wed, Aug 24, 2016 12:41 AM, Sean Owen so...@cloudera.com wrote:
Byte code, no. It's sufficient to store the information that the RDD represents,
which can include serialized function closures, but that's not quite storing
byte code.
On Wed, Aug 24, 2016 at 2:00 AM, kant kodali < kanth...@gmail.com > wrote:
Hi Guys,
I have this question for a very long time and after diving into the source
code(specifically from the links below) I have a feeling that the lineage of an
RDD (the transformations) are converted into byte code and stored in memory or
disk. or if I were to ask another question on a similar note do we ever store
JVM byte code or python byte code in memory or disk? This make sense to me
because if we were to construct an RDD after a node failure we need to go
through the lineage and execute the respective transformations so storing their
byte codes does make sense however many people seem to disagree with me so it
would be great if someone can clarify.
https://github.com/apache/ spark/blob/ 6ee40d2cc5f467c78be662c1639fc3
d5b7f796cf/python/pyspark/rdd. py#L1452
https://github.com/apache/ spark/blob/ 6ee40d2cc5f467c78be662c1639fc3
d5b7f796cf/python/pyspark/rdd. py#L1471
https://github.com/apache/ spark/blob/ 6ee40d2cc5f467c78be662c1639fc3
d5b7f796cf/python/pyspark/rdd. py#L229
https://github.com/apache/ spark/blob/master/python/ pyspark/cloudpickle.py#L241

Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread Sean Owen
Byte code, no. It's sufficient to store the information that the RDD
represents, which can include serialized function closures, but that's not
quite storing byte code.

On Wed, Aug 24, 2016 at 2:00 AM, kant kodali  wrote:

> Hi Guys,
>
> I have this question for a very long time and after diving into the source
> code(specifically from the links below) I have a feeling that the lineage
> of an RDD (the transformations) are converted into byte code and stored in
> memory or disk. or if I were to ask another question on a similar note do
> we ever store JVM byte code or python byte code in memory or disk? This
> make sense to me because if we were to construct an RDD after a node
> failure we need to go through the lineage and execute the respective
> transformations so storing their byte codes does make sense however many
> people seem to disagree with me so it would be great if someone can clarify.
>
> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3
> d5b7f796cf/python/pyspark/rdd.py#L1452
>
> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3
> d5b7f796cf/python/pyspark/rdd.py#L1471
>
> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3
> d5b7f796cf/python/pyspark/rdd.py#L229
>
> https://github.com/apache/spark/blob/master/python/
> pyspark/cloudpickle.py#L241
>


Re: Spark dev-setup

2016-08-24 Thread Nishadi Kirielle
Hi,
I'm engaged in learning how query execution flow occurs in Spark SQL. In
order to understand the query execution flow, I'm attempting to run an
example in debug mode with intellij IDEA. It would be great if anyone can
help me with debug configurations.

Thanks & Regards
Nishadi

On Tue, Jun 21, 2016 at 4:49 PM, Akhil Das  wrote:

> You can read this documentation to get started with the setup
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#
> UsefulDeveloperTools-IntelliJ
>
> There was a pyspark setup discussion on SO over here
> http://stackoverflow.com/questions/33478218/write-and-
> run-pyspark-in-intellij-idea
>
> On Mon, Jun 20, 2016 at 7:23 PM, Amit Rana 
> wrote:
>
>> Hi all,
>>
>> I am interested  in figuring out how pyspark works at core/internal
>> level. And  would like to understand the code flow as well.
>> For that I need to run a simple  example  in debug mode so that I can
>> trace the data flow for pyspark.
>> Can anyone please guide me on how do I set up my development environment
>> for the same in intellij IDEA in Windows 7.
>>
>> Thanks,
>> Amit Rana
>>
>
>
>
> --
> Cheers!
>
>