Re: structured streaming join of streaming dataframe with static dataframe performance

2022-08-04 Thread kant kodali
I suspect it is probably because the incoming rows when I joined with static 
frame can lead to variable degree of skewness over time and if so it is 
probably better to employ different join strategies at run time. But if you 
know your Dataset I believe you can just do broadcast join for your case! 

Its been a while since I used spark so you might want to wait for more 
authoritative response 

Sent from my iPhone

> On Jul 17, 2022, at 5:38 PM, Koert Kuipers  wrote:
> 
> 
> i was surprised to find out that if a streaming dataframe is joined with a 
> static dataframe, that the static dataframe is re-shuffled for every 
> microbatch, which adds considerable overhead.
> 
> wouldn't it make more sense to re-use the shuffle files?
> 
> or if that is not possible then load the static dataframe into the 
> statestore? this would turn the join into a lookup (in rocksdb)?
> 
> 
> CONFIDENTIALITY NOTICE: This electronic communication and any files 
> transmitted with it are confidential, privileged and intended solely for the 
> use of the individual or entity to whom they are addressed. If you are not 
> the intended recipient, you are hereby notified that any disclosure, copying, 
> distribution (electronic or otherwise) or forwarding of, or the taking of any 
> action in reliance on the contents of this transmission is strictly 
> prohibited. Please notify the sender immediately by e-mail if you have 
> received this email by mistake and delete this email from your system.
> 
> Is it necessary to print this email? If you care about the environment like 
> we do, please refrain from printing emails. It helps to keep the environment 
> forested and litter-free.


Re: https://spark-project.atlassian.net/browse/SPARK-1153

2020-02-24 Thread kant kodali
Sorry please ignore this. I accidentally ran it with GraphX instead of
Graphframes.

I see the code here
https://github.com/graphframes/graphframes/blob/a30adaf53dece8c548d96c895ac330ecb3931451/src/main/scala/org/graphframes/GraphFrame.scala#L539-L555
Which indeed generates its own id! that's great!

Thanks

On Sun, Feb 23, 2020 at 3:53 PM kant kodali  wrote:

> Hi All,
>
> Any chance of fixing this one ?
> https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
> around may be?
>
> Currently, I got bunch of events streaming into kafka across various
> topics and they are stamped with an UUIDv1 for each event. so it is easy to
> construct edges using UUID. I am not quite sure how to generate a long
> based unique id without synchronization in a distributed setting. I had
> read this SO post
> <https://stackoverflow.com/questions/15184820/how-to-generate-unique-positive-long-using-uuid>
>  which
> shows there are two ways one may be able to achieve this
>
> 1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE
>
> 2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
> 9223372036854251520L)
>
> However I am concerned about collisions and looking for the probability of
> collisions for the above two approaches. any suggestions?
>
> I ran the Connected Components algorithms using graphframes it runs well
> when long based id's are used but with string the performance drops
> significantly as pointed out in the ticket. I understand that algorithm
> depends on hashing integers heavily but I wonder why not fixed length
> byte[] ? that way we can convert any datatype to sequence of bytes.
>
> Thanks!
>


Re: SparkGraph review process

2020-02-23 Thread kant kodali
Hi Sean,

In that case, Can we have Graphframes as part of spark release? or separate
release is also fine. Currently, I don't see any releases w.r.t Graphframes.

Thanks


On Fri, Feb 14, 2020 at 9:06 AM Sean Owen  wrote:

> This will not be Spark 3.0, no.
>
> On Fri, Feb 14, 2020 at 1:12 AM kant kodali  wrote:
> >
> > any update on this? Is spark graph going to make it into Spark or no?
> >
> > On Mon, Oct 14, 2019 at 12:26 PM Holden Karau 
> wrote:
> >>
> >> Maybe let’s ask the folks from Lightbend who helped with the previous
> scala upgrade for their thoughts?
> >>
> >> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li  wrote:
> >>>>
> >>>> 1. On the technical side, my main concern is the runtime dependency
> on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
> >>>
> >>>
> >>> This concern is valid. I think we should start the vote to ensure the
> whole community is aware of the risk and take the responsibility to
> maintain this in the long term.
> >>>
> >>> Cheers,
> >>>
> >>> Xiao
> >>>
> >>>
> >>> Xiangrui Meng  于2019年10月4日周五 下午12:27写道:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I want to clarify my role first to avoid misunderstanding. I'm an
> individual contributor here. My work on the graph SPIP as well as other
> Spark features I contributed to are not associated with my employer. It
> became quite challenging for me to keep track of the graph SPIP work due to
> less available time at home.
> >>>>
> >>>> On retrospective, we should have involved more Spark devs and
> committers early on so there is no single point of failure, i.e., me.
> Hopefully it is not too late to fix. I summarize my thoughts here to help
> onboard other reviewers:
> >>>>
> >>>> 1. On the technical side, my main concern is the runtime dependency
> on org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.
> >>>>
> >>>> 2. Overloading helper methods. MLlib used to have several overloaded
> helper methods for each algorithm, which later became a major maintenance
> burden. Builders and setters/getters are more maintainable. I will comment
> again on the PR.
> >>>>
> >>>> 3. The proposed API partitions graph into sub-graphs, as described in
> the property graph model. It is unclear to me how it would affect query
> performance because it requires SQL optimizer to correctly recognize data
> from the same source and make execution efficient.
> >>>>
> >>>> 4. The feature, although originally targeted for Spark 3.0, should
> not be a Spark 3.0 release blocker because it doesn't require breaking
> changes. If we miss the code freeze deadline, we can introduce a build flag
> to exclude the module from the official release/distribution, and then make
> it default once the module is ready.
> >>>>
> >>>> 5. If unfortunately we still don't see sufficient committer reviews,
> I think the best option would be submitting the work to Apache Incubator
> instead to unblock the work. But maybe it is too earlier to discuss this
> option.
> >>>>
> >>>> It would be great if other committers can offer help on the review!
> Really appreciated!
> >>>>
> >>>> Best,
> >>>> Xiangrui
> >>>>
> >>>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg 
> wrote:
> >>>>>
> >>>>> Hello dear Spark community
> >>>>>
> >>>>> We are the developers behind the SparkGraph SPIP, w

https://spark-project.atlassian.net/browse/SPARK-1153

2020-02-23 Thread kant kodali
Hi All,

Any chance of fixing this one ?
https://spark-project.atlassian.net/browse/SPARK-1153 or offer some work
around may be?

Currently, I got bunch of events streaming into kafka across various topics
and they are stamped with an UUIDv1 for each event. so it is easy to
construct edges using UUID. I am not quite sure how to generate a long
based unique id without synchronization in a distributed setting. I had
read this SO post

which
shows there are two ways one may be able to achieve this

1.  UUID.randomUUID().getMostSignificantBits() & Long.MAX_VALUE

2.  (System.currentTimeMillis() << 20) | (System.nanoTime() & ~
9223372036854251520L)

However I am concerned about collisions and looking for the probability of
collisions for the above two approaches. any suggestions?

I ran the Connected Components algorithms using graphframes it runs well
when long based id's are used but with string the performance drops
significantly as pointed out in the ticket. I understand that algorithm
depends on hashing integers heavily but I wonder why not fixed length
byte[] ? that way we can convert any datatype to sequence of bytes.

Thanks!


Re: SparkGraph review process

2020-02-13 Thread kant kodali
any update on this? Is spark graph going to make it into Spark or no?

On Mon, Oct 14, 2019 at 12:26 PM Holden Karau  wrote:

> Maybe let’s ask the folks from Lightbend who helped with the previous
> scala upgrade for their thoughts?
>
> On Mon, Oct 14, 2019 at 8:24 PM Xiao Li  wrote:
>
>> 1. On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>
>>
>> This concern is valid. I think we should start the vote to ensure the
>> whole community is aware of the risk and take the responsibility to
>> maintain this in the long term.
>>
>> Cheers,
>>
>> Xiao
>>
>>
>> Xiangrui Meng  于2019年10月4日周五 下午12:27写道:
>>
>>> Hi all,
>>>
>>> I want to clarify my role first to avoid misunderstanding. I'm an
>>> individual contributor here. My work on the graph SPIP as well as other
>>> Spark features I contributed to are not associated with my employer. It
>>> became quite challenging for me to keep track of the graph SPIP work due to
>>> less available time at home.
>>>
>>> On retrospective, we should have involved more Spark devs and committers
>>> early on so there is no single point of failure, i.e., me. Hopefully it is
>>> not too late to fix. I summarize my thoughts here to help onboard other
>>> reviewers:
>>>
>>> 1. On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade. okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>>
>>> 2. Overloading helper methods. MLlib used to have several overloaded
>>> helper methods for each algorithm, which later became a major maintenance
>>> burden. Builders and setters/getters are more maintainable. I will comment
>>> again on the PR.
>>>
>>> 3. The proposed API partitions graph into sub-graphs, as described in
>>> the property graph model. It is unclear to me how it would affect query
>>> performance because it requires SQL optimizer to correctly recognize data
>>> from the same source and make execution efficient.
>>>
>>> 4. The feature, although originally targeted for Spark 3.0, should not
>>> be a Spark 3.0 release blocker because it doesn't require breaking changes.
>>> If we miss the code freeze deadline, we can introduce a build flag to
>>> exclude the module from the official release/distribution, and then make it
>>> default once the module is ready.
>>>
>>> 5. If unfortunately we still don't see sufficient committer reviews, I
>>> think the best option would be submitting the work to Apache Incubator
>>> instead to unblock the work. But maybe it is too earlier to discuss this
>>> option.
>>>
>>> It would be great if other committers can offer help on the review!
>>> Really appreciated!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Fri, Oct 4, 2019 at 1:32 AM Mats Rydberg 
>>> wrote:
>>>
 Hello dear Spark community

 We are the developers behind the SparkGraph SPIP, which is a project
 created out of our work on openCypher Morpheus (
 https://github.com/opencypher/morpheus). During this year we have
 collaborated with mainly Xiangrui Meng of Databricks to define and develop
 a new SparkGraph module based on our experience from working on Morpheus.
 Morpheus - formerly known as "Cypher for Apache Spark" - has been in
 development for over 3 years and matured in its API and implementation.

 The SPIP work has been on hold for a period of time now, as priorities
 at Databricks have changed which has occupied Xiangrui's time (as well as
 other happenings). As you may know, the latest API PR (
 https://github.com/apache/spark/pull/24851) is blocking us from moving
 forward with the implementation.

 In an attempt to not lose track of this project we now reach out to you
 to ask whether there are any Spark committers in the community who would be
 prepared to commit to helping us review and merge our code contributions to
 Apache Spark? We are not asking for lots of direct development support, as
 we believe we have the implementation more 

https://github.com/google/zetasql

2019-05-21 Thread kant kodali
https://github.com/google/zetasql


Re: queryable state & streaming

2019-03-16 Thread kant kodali
Any update on this?

On Wed, Oct 24, 2018 at 4:26 PM Arun Mahadevan  wrote:

> I don't think separate API or RPCs etc might be necessary for queryable
> state if the state can be exposed as just another datasource. Then the sql
> queries can be issued against it just like executing sql queries against
> any other data source.
>
> For now I think the "memory" sink could be used  as a sink and run queries
> against it but I agree it does not scale for large states.
>
> On Sun, 21 Oct 2018 at 21:24, Jungtaek Lim  wrote:
>
>> It doesn't seem Spark has workarounds other than storing output into
>> external storages, so +1 on having this.
>>
>> My major concern on implementing queryable state in structured streaming
>> is "Are all states available on executors at any time while query is
>> running?" Querying state shouldn't affect the running query. Given that
>> state is huge and default state provider is loading state in memory, we may
>> not want to load one more redundant snapshot of state: we want to always
>> load "current state" which query is also using. (For sure, Queryable state
>> should be read-only.)
>>
>> Regarding improvement of local state, I guess it is ideal to leverage
>> embedded db, like Kafka and Flink are doing. The difference will not be
>> only reading state from non-heap, but also how to take a snapshot and store
>> delta. We may want to check snapshotting works well with small batch
>> interval, and find alternative approach when it doesn't. Sounds like it is
>> a huge item and can be handled individually.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 2017년 12월 9일 (토) 오후 10:51, Stavros Kontopoulos 님이
>> 작성:
>>
>>> Nice I was looking for a jira. So I agree we should justify why we are
>>> building something. Now to that direction here is what I have seen from my
>>> experience.
>>> People quite often use state within their streaming app and may have
>>> large states (TBs). Shortening the pipeline by not having to copy data (to
>>> Cassandra for example for serving) is an advantage, in terms of at least
>>> latency and complexity.
>>> This can be true if we advantage of state checkpointing (locally could
>>> be RocksDB or in general HDFS the latter is currently supported)  along
>>> with an API to efficiently query data.
>>> Some use cases I see:
>>>
>>> - real-time dashboards and real-time reporting, the faster the better
>>> - monitoring of state for operational reasons, app health etc...
>>> - integrating with external services via an API eg. making accessible
>>>  aggregations over time windows to some third party service within your
>>> system
>>>
>>> Regarding requirements here are some of them:
>>> - support of an API to expose state (could be done at the spark driver),
>>> like rest.
>>> - supporting dynamic allocation (not sure how it affects state
>>> management)
>>> - an efficient way to talk to executors to get the state (rpc?)
>>> - making local state more efficient and easier accessible with an
>>> embedded db (I dont think this is supported from what I see, maybe wrong)?
>>> Some people are already working with such techs and some stuff could be
>>> re-used: https://issues.apache.org/jira/browse/SPARK-20641
>>>
>>> Best,
>>> Stavros
>>>
>>>
>>> On Fri, Dec 8, 2017 at 10:32 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 https://issues.apache.org/jira/browse/SPARK-16738

 I don't believe anyone is working on it yet.  I think the most useful
 thing is to start enumerating requirements and use cases and then we can
 talk about how to build it.

 On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos <
 st.kontopou...@gmail.com> wrote:

> Cool Burak do you have a pointer, should I take the initiative for a
> first design document or Databricks is working on it?
>
> Best,
> Stavros
>
> On Fri, Dec 8, 2017 at 8:40 PM, Burak Yavuz  wrote:
>
>> Hi Stavros,
>>
>> Queryable state is definitely on the roadmap! We will revamp the
>> StateStore API a bit, and a queryable StateStore is definitely one of the
>> things we are thinking about during that revamp.
>>
>> Best,
>> Burak
>>
>> On Dec 8, 2017 9:57 AM, "Stavros Kontopoulos" <
>> st.kontopou...@gmail.com> wrote:
>>
>>> Just to re-phrase my question: Would query-able state make a viable
>>> SPIP?
>>>
>>> Regards,
>>> Stavros
>>>
>>> On Thu, Dec 7, 2017 at 1:34 PM, Stavros Kontopoulos <
>>> st.kontopou...@gmail.com> wrote:
>>>
 Hi,

 Maybe this has been discussed before. Given the fact that many
 streaming apps out there use state extensively, could be a good idea to
 make Spark expose streaming state with an external API like other
 systems do (Kafka streams, Flink etc), in order to facilitate
 interactive queries?

 Regards,
 Stavros

>>>
>>>
>

>>>


Re: Plan on Structured Streaming in next major/minor release?

2018-11-02 Thread kant kodali
If I can add one thing to this list I would say stateless aggregations
using Raw SQL.

For example: As I read micro-batches from Kafka I want to do say a count of
that micro batch and spit it out using Raw SQL . (No Count aggregation
across batches.)



On Tue, Oct 30, 2018 at 4:55 PM Jungtaek Lim  wrote:

> OK thanks for clarifying. I guess it is one of major features in streaming
> area and nice to add, but also agree it would require huge investigation.
>
> 2018년 10월 31일 (수) 오전 8:06, Michael Armbrust 님이 작성:
>
>> Agree. Just curious, could you explain what do you mean by "negation"?
>>> Does it mean applying retraction on aggregated?
>>>
>>
>> Yeah exactly.  Our current streaming aggregation assumes that the input
>> is in append-mode and multiple aggregations break this.
>>
>


Re: Plan on Structured Streaming in next major/minor release?

2018-10-20 Thread kant kodali
+1 For Raising all this.
+1 For Queryable State (SPARK-16738 [3])

On Thu, Oct 18, 2018 at 9:59 PM Jungtaek Lim  wrote:

> Small correction: "timeout" in map/flatmapGroupsWithState would not work
> similar as State TTL when event time and watermark is set. So timeout in
> map/flatmapGroupsWithState is to guarantee removal of state when the state
> will not be used, as similar as what we do with streaming aggregation,
> whereas State TTL is just work as its name is represented
> (self-explanatory). Hence State TTL looks valid for all the cases.
>
> 2018년 10월 19일 (금) 오후 12:20, Jungtaek Lim 님이 작성:
>
>> Hi devs,
>>
>> While Spark 2.4.0 is still in progress of release votes, I'm seeing some
>> pull requests on non-SS are being reviewed and merged into master branch,
>> so I guess discussion about next release is OK.
>>
>> Looks like there's a major TODO left on structured streaming: allowing
>> stateful operation in continuous mode (watermark, stateful exactly-once)
>> and no other major milestone is shared. (Please let me know if I'm missing
>> here!) As a structured streaming contributor's point of view, there're
>> another features we could discuss and see which are good to have, and
>> prioritize if possible (NOTE: just a brainstorming and some items might not
>> be valid for structured streaming):
>>
>> * Native support on session window (SPARK-10816 [1])
>>   ** patch available
>> * Support delegation token on Kafka (SPARK-25501 [2])
>>   ** patch available
>> * Queryable State (SPARK-16738 [3])
>>   ** some discussion took place, but no action is taken yet
>> * End to end exactly-once with Kafka sink
>>   ** given Kafka is the first class on streaming source/sink nowadays
>> * Custom window / custom watermark
>> * Physically scale (up/down) streaming state
>> * State TTL (especially for non-watermark state)
>>   ** "timeout" in map/flatmapGroupsWithState fits it, but just to check
>> whether we want to have it for normal streaming aggregation
>> * Provide discarded events due to late via side output or similar feature
>>   ** for me it looks like tricky one, since Spark's RDD as well as SQL
>> semantic provide one output
>> * more?
>>
>> Would like to hear others opinions about this. Please also share if
>> there're ongoing efforts on other items for structured streaming. Happy to
>> help out if it needs another hand.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-10816
>> 2. https://issues.apache.org/jira/browse/SPARK-25501
>> 3. https://issues.apache.org/jira/browse/SPARK-16738
>>
>>


Re: Feature request: Java-specific transform method in Dataset

2018-07-01 Thread kant kodali
I am not affiliated with Flink or Spark but I do think some of the thoughts
here

makes sense

On Sun, Jul 1, 2018 at 4:12 PM, Sean Owen  wrote:

> It's true, that is one of the issues to be solved by the 2.12-compatible
> build, because it otherwise introduces an overload ambiguity for Java 8
> lambdas. But for that reason I think the current transform() method would
> start working with lambdas. That would only help 2.12 builds; maybe that's
> an OK solution?
>
>
> On Sun, Jul 1, 2018, 2:36 PM Reynold Xin  wrote:
>
>> This wouldn’t be a problem with Scala 2.12 right?
>>
>> On Sun, Jul 1, 2018 at 12:23 PM Sean Owen  wrote:
>>
>>> I see, transform() doesn't have the same overload that other methods do
>>> in order to support Java 8 lambdas as you'd expect. One option is to
>>> introduce something like MapFunction for transform and introduce an
>>> overload.
>>>
>>> I think transform() isn't used much at all, so maybe why it wasn't
>>> Java-fied. Before Java 8 it wouldn't have made much sense in Java. Now it
>>> might. I think it could be OK to add the overload to match how map works.
>>>
>>> On Sun, Jul 1, 2018 at 1:33 PM Ismael Carnales 
>>> wrote:
>>>
 No, because Function1 from Scala is not a functional interface.
 You can see a simple example of what I'm trying to accomplish In the
 unit test here:
 https://github.com/void/spark/blob/java-transform/sql/core/
 src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java#L73


 On Sun, Jul 1, 2018 at 2:48 PM Sean Owen  wrote:

> Don't Java 8 lambdas let you do this pretty immediately? Can you give
> an example here of what you want to do and how you are trying to do it?
>
> On Sun, Jul 1, 2018, 12:42 PM Ismael Carnales 
> wrote:
>
>> Hi,
>>  it would be nice to have an easier way to use the Dataset transform
>> method from Java than implementing a Function1 from Scala.
>>
>> I've made a simple implentation here:
>>
>> https://github.com/void/spark/tree/java-transform
>>
>> Should I open a JIRA?
>>
>> Ismael Carnales
>>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread kant kodali
I am not sure how SPARK-23406
<https://issues.apache.org/jira/browse/SPARK-23406> is a new feature. since
streaming joins are already part of SPARK 2.3.0. The self joins didn't work
because of a bug and it is fixed but I can understand if it touches some
other code paths.

On Wed, May 16, 2018 at 3:22 AM, Marco Gaido <marcogaid...@gmail.com> wrote:

> I'd be against having a new feature in a minor maintenance release. I
> think such a release should contain only bugfixes.
>
> 2018-05-16 12:11 GMT+02:00 kant kodali <kanth...@gmail.com>:
>
>> Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of
>> 2.3.1?
>>
>> On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>
>>> Bummer. People should still feel welcome to test the existing RC so we
>>> can rule out other issues.
>>>
>>> On Tue, May 15, 2018 at 2:04 PM, Xiao Li <gatorsm...@gmail.com> wrote:
>>> > -1
>>> >
>>> > We have a correctness bug fix that was merged after 2.3 RC1. It would
>>> be
>>> > nice to have that in Spark 2.3.1 release.
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-24259
>>> >
>>> > Xiao
>>> >
>>> >
>>> > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin <van...@cloudera.com>:
>>> >>
>>> >> Please vote on releasing the following candidate as Apache Spark
>>> version
>>> >> 2.3.1.
>>> >>
>>> >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
>>> >> a majority of at least 3 +1 PMC votes are cast.
>>> >>
>>> >> [ ] +1 Release this package as Apache Spark 2.3.1
>>> >> [ ] -1 Do not release this package because ...
>>> >>
>>> >> To learn more about Apache Spark, please see http://spark.apache.org/
>>> >>
>>> >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
>>> >> https://github.com/apache/spark/tree/v2.3.0-rc1
>>> >>
>>> >> The release files, including signatures, digests, etc. can be found
>>> at:
>>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
>>> >>
>>> >> Signatures used for Spark RCs can be found in this file:
>>> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >>
>>> >> The staging repository for this release can be found at:
>>> >> https://repository.apache.org/content/repositories/orgapache
>>> spark-1269/
>>> >>
>>> >> The documentation corresponding to this release can be found at:
>>> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
>>> >>
>>> >> The list of bug fixes going into 2.3.1 can be found at the following
>>> URL:
>>> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>> >>
>>> >> FAQ
>>> >>
>>> >> =
>>> >> How can I help test this release?
>>> >> =
>>> >>
>>> >> If you are a Spark user, you can help us test this release by taking
>>> >> an existing Spark workload and running on this release candidate, then
>>> >> reporting any regressions.
>>> >>
>>> >> If you're working in PySpark you can set up a virtual env and install
>>> >> the current RC and see if anything important breaks, in the Java/Scala
>>> >> you can add the staging repository to your projects resolvers and test
>>> >> with the RC (make sure to clean up the artifact cache before/after so
>>> >> you don't end up building with a out of date RC going forward).
>>> >>
>>> >> ===
>>> >> What should happen to JIRA tickets still targeting 2.3.1?
>>> >> ===
>>> >>
>>> >> The current list of open tickets targeted at 2.3.1 can be found at:
>>> >> https://s.apache.org/Q3Uo
>>> >>
>>> >> Committers should look at those and triage. Extremely important bug
>>> >> fixes, documentation, and API tweaks that impact compatibility should
>>> >> be worked on immediately. Everything else please retarget to an
>>> >> appropriate release.
>>> >>
>>> >> ==
>>> >> But my bug isn't fixed?
>>> >> ==
>>> >>
>>> >> In order to make timely releases, we will typically not hold the
>>> >> release unless the bug in question is a regression from the previous
>>> >> release. That being said, if there is something which is a regression
>>> >> that has not been correctly targeted please ping me or a committer to
>>> >> help target the issue.
>>> >>
>>> >>
>>> >> --
>>> >> Marcelo
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: [VOTE] Spark 2.3.1 (RC1)

2018-05-16 Thread kant kodali
Can this https://issues.apache.org/jira/browse/SPARK-23406 be part of 2.3.1?

On Tue, May 15, 2018 at 2:07 PM, Marcelo Vanzin  wrote:

> Bummer. People should still feel welcome to test the existing RC so we
> can rule out other issues.
>
> On Tue, May 15, 2018 at 2:04 PM, Xiao Li  wrote:
> > -1
> >
> > We have a correctness bug fix that was merged after 2.3 RC1. It would be
> > nice to have that in Spark 2.3.1 release.
> >
> > https://issues.apache.org/jira/browse/SPARK-24259
> >
> > Xiao
> >
> >
> > 2018-05-15 14:00 GMT-07:00 Marcelo Vanzin :
> >>
> >> Please vote on releasing the following candidate as Apache Spark version
> >> 2.3.1.
> >>
> >> The vote is open until Friday, May 18, at 21:00 UTC and passes if
> >> a majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 2.3.1
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see http://spark.apache.org/
> >>
> >> The tag to be voted on is v2.3.1-rc1 (commit cc93bc95):
> >> https://github.com/apache/spark/tree/v2.3.0-rc1
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-bin/
> >>
> >> Signatures used for Spark RCs can be found in this file:
> >> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1269/
> >>
> >> The documentation corresponding to this release can be found at:
> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc1-docs/
> >>
> >> The list of bug fixes going into 2.3.1 can be found at the following
> URL:
> >> https://issues.apache.org/jira/projects/SPARK/versions/12342432
> >>
> >> FAQ
> >>
> >> =
> >> How can I help test this release?
> >> =
> >>
> >> If you are a Spark user, you can help us test this release by taking
> >> an existing Spark workload and running on this release candidate, then
> >> reporting any regressions.
> >>
> >> If you're working in PySpark you can set up a virtual env and install
> >> the current RC and see if anything important breaks, in the Java/Scala
> >> you can add the staging repository to your projects resolvers and test
> >> with the RC (make sure to clean up the artifact cache before/after so
> >> you don't end up building with a out of date RC going forward).
> >>
> >> ===
> >> What should happen to JIRA tickets still targeting 2.3.1?
> >> ===
> >>
> >> The current list of open tickets targeted at 2.3.1 can be found at:
> >> https://s.apache.org/Q3Uo
> >>
> >> Committers should look at those and triage. Extremely important bug
> >> fixes, documentation, and API tweaks that impact compatibility should
> >> be worked on immediately. Everything else please retarget to an
> >> appropriate release.
> >>
> >> ==
> >> But my bug isn't fixed?
> >> ==
> >>
> >> In order to make timely releases, we will typically not hold the
> >> release unless the bug in question is a regression from the previous
> >> release. That being said, if there is something which is a regression
> >> that has not been correctly targeted please ping me or a committer to
> >> help target the issue.
> >>
> >>
> >> --
> >> Marcelo
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread kant kodali
Hi All,

+1 for the tickets proposed by Ryan Blue

Any possible chance of this one
https://issues.apache.org/jira/browse/SPARK-23406 getting into 2.3.0? It's
a very important feature for us so if it doesn't make the cut I would have
to cherry-pick this commit and compile from the source for our production
release.

Thanks!

On Wed, Feb 21, 2018 at 9:01 AM, Ryan Blue 
wrote:

> What does everyone think about getting some of the newer DataSourceV2
> improvements in? It should be low risk because it is a new code path, and
> v2 isn't very usable without things like support for using the output
> commit coordinator to deconflict writes.
>
> The ones I'd like to get in are:
> * Use the output commit coordinator: https://issues.
> apache.org/jira/browse/SPARK-23323
> * Use immutable trees and the same push-down logic as other read paths:
> https://issues.apache.org/jira/browse/SPARK-23203
> * Don't allow users to supply schemas when they aren't supported:
> https://issues.apache.org/jira/browse/SPARK-23418
>
> I think it would make the 2.3.0 release more usable for anyone interested
> in the v2 read and write paths.
>
> Thanks!
>
> On Tue, Feb 20, 2018 at 7:07 PM, Weichen Xu 
> wrote:
>
>> +1
>>
>> On Wed, Feb 21, 2018 at 10:07 AM, Marcelo Vanzin 
>> wrote:
>>
>>> Done, thanks!
>>>
>>> On Tue, Feb 20, 2018 at 6:05 PM, Sameer Agarwal 
>>> wrote:
>>> > Sure, please feel free to backport.
>>> >
>>> > On 20 February 2018 at 18:02, Marcelo Vanzin 
>>> wrote:
>>> >>
>>> >> Hey Sameer,
>>> >>
>>> >> Mind including https://github.com/apache/spark/pull/20643
>>> >> (SPARK-23468)  in the new RC? It's a minor bug since I've only hit it
>>> >> with older shuffle services, but it's pretty safe.
>>> >>
>>> >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal 
>>> >> wrote:
>>> >> > This RC has failed due to
>>> >> > https://issues.apache.org/jira/browse/SPARK-23470.
>>> >> > Now that the fix has been merged in 2.3 (thanks Marcelo!), I'll
>>> follow
>>> >> > up
>>> >> > with an RC5 soon.
>>> >> >
>>> >> > On 20 February 2018 at 16:49, Ryan Blue  wrote:
>>> >> >>
>>> >> >> +1
>>> >> >>
>>> >> >> Build & tests look fine, checked signature and checksums for src
>>> >> >> tarball.
>>> >> >>
>>> >> >> On Tue, Feb 20, 2018 at 12:54 PM, Shixiong(Ryan) Zhu
>>> >> >>  wrote:
>>> >> >>>
>>> >> >>> I'm -1 because of the UI regression
>>> >> >>> https://issues.apache.org/jira/browse/SPARK-23470 : the All Jobs
>>> page
>>> >> >>> may be
>>> >> >>> too slow and cause "read timeout" when there are lots of jobs and
>>> >> >>> stages.
>>> >> >>> This is one of the most important pages because when it's broken,
>>> it's
>>> >> >>> pretty hard to use Spark Web UI.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Tue, Feb 20, 2018 at 4:37 AM, Marco Gaido <
>>> marcogaid...@gmail.com>
>>> >> >>> wrote:
>>> >> 
>>> >>  +1
>>> >> 
>>> >>  2018-02-20 12:30 GMT+01:00 Hyukjin Kwon :
>>> >> >
>>> >> > +1 too
>>> >> >
>>> >> > 2018-02-20 14:41 GMT+09:00 Takuya UESHIN <
>>> ues...@happy-camper.st>:
>>> >> >>
>>> >> >> +1
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Feb 20, 2018 at 2:14 PM, Xingbo Jiang
>>> >> >> 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> +1
>>> >> >>>
>>> >> >>>
>>> >> >>> Wenchen Fan 于2018年2月20日 周二下午1:09写道:
>>> >> 
>>> >>  +1
>>> >> 
>>> >>  On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin
>>> >>  
>>> >>  wrote:
>>> >> >
>>> >> > +1
>>> >> >
>>> >> > On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal
>>> >> > , wrote:
>>> >> >>
>>> >> >> this file shouldn't be included?
>>> >> >>
>>> >> >> https://dist.apache.org/repos/
>>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>> >> >
>>> >> >
>>> >> > I've now deleted this file
>>> >> >
>>> >> >> From: Sameer Agarwal 
>>> >> >> Sent: Saturday, February 17, 2018 1:43:39 PM
>>> >> >> To: Sameer Agarwal
>>> >> >> Cc: dev
>>> >> >> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
>>> >> >>
>>> >> >> I'll start with a +1 once again.
>>> >> >>
>>> >> >> All blockers reported against RC3 have been resolved and
>>> the
>>> >> >> builds are healthy.
>>> >> >>
>>> >> >> On 17 February 2018 at 13:41, Sameer Agarwal
>>> >> >> 
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Please vote on releasing the following candidate as Apache
>>> >> >>> Spark
>>> >> >>> version 2.3.0. The vote is open until Thursday February
>>> 22,
>>> >> >>> 2018 at 

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread kant kodali
+1

On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin  wrote:

> Awesome! Congrats! Can't wait!!
>
> jg
>
>
> On Jul 11, 2017, at 18:48, Michael Armbrust 
> wrote:
>
> Hi all,
>
> Apache Spark 2.2.0 is the third release of the Spark 2.x line. This
> release removes the experimental tag from Structured Streaming. In
> addition, this release focuses on usability, stability, and polish,
> resolving over 1100 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 2.2.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes: https://spark.apache.or
> g/releases/spark-release-2-2-0.html
>
> *(note: If you see any issues with the release notes, webpage or published
> artifacts, please contact me directly off-list) *
>
> Michael
>
>


Re: Question on Spark code

2017-06-25 Thread kant kodali
impressive! I need to learn more about scala.

What I mean stripping away conditional check in Java is this.

static final boolean isLogInfoEnabled = false;

public void logMessage(String message) {
if(isLogInfoEnabled) {
log.info(message)
}
}

If you look at the byte code the dead if check will be removed.








On Sun, Jun 25, 2017 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:

> I think it's more precise to say args like any expression are evaluated
> when their value is required. It's just that this special syntax causes
> extra code to be generated that makes it effectively a function passed, not
> value, and one that's lazily evaluated. Look at the bytecode if you're
> curious.
>
> An if conditional is pretty trivial to evaluate here. I don't think that
> guidance is sound. The point is that it's not worth complicating the caller
> code in almost all cases by checking the guard condition manually.
>
> I'm not sure what you're referring to, but no, no compiler can elide these
> conditions. They're based on runtime values that can change at runtime.
>
> scala has an @elidable annotation which you can use to indicate to the
> compiler that the declaration should be entirely omitted when configured to
> elide above a certain detail level. This is how scalac elides assertions if
> you ask it to and you can do it to your own code. But this is something
> different, not what's happening here, and a fairly niche/advanced feature.
>
>
> On Sun, Jun 25, 2017 at 8:25 PM kant kodali <kanth...@gmail.com> wrote:
>
>> @Sean Got it! I come from Java world so I guess I was wrong in assuming
>> that arguments are evaluated during the method invocation time. How about
>> the conditional checks to see if the log is InfoEnabled or DebugEnabled?
>> For Example,
>>
>> if (log.isInfoEnabled) log.info(msg)
>>
>> I hear we should use guard condition only when the string "msg"
>> construction is expensive otherwise we will be taking a performance hit
>> because of the additional "if" check unless the "log" itself is declared
>> static final and scala compiler will strip away the "if" check and produce
>> efficient byte code. Also log.info does the log.isInfoEnabled check
>> inside the body prior to logging the messages.
>>
>> https://github.com/qos-ch/slf4j/blob/master/slf4j-
>> simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L509
>> https://github.com/qos-ch/slf4j/blob/master/slf4j-
>> simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L599
>>
>> Please correct me if I am wrong.
>>
>>
>>
>>
>> On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Maybe you are looking for declarations like this. "=> String" means the
>>> arg isn't evaluated until it's used, which is just what you want with log
>>> statements. The message isn't constructed unless it will be logged.
>>>
>>> protected def logInfo(msg: => String) {
>>>
>>>
>>> On Sun, Jun 25, 2017 at 10:28 AM kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I came across this file https://github.com/
>>>> apache/spark/blob/master/core/src/main/scala/org/apache/
>>>> spark/internal/Logging.scala and I am wondering what is the purpose of
>>>> this? Especially it doesn't prevent any string concatenation and also the
>>>> if checks are already done by the library itself right?
>>>>
>>>> Thanks!
>>>>
>>>>
>>


Re: Question on Spark code

2017-06-25 Thread kant kodali
@Sean Got it! I come from Java world so I guess I was wrong in assuming
that arguments are evaluated during the method invocation time. How about
the conditional checks to see if the log is InfoEnabled or DebugEnabled?
For Example,

if (log.isInfoEnabled) log.info(msg)

I hear we should use guard condition only when the string "msg"
construction is expensive otherwise we will be taking a performance hit
because of the additional "if" check unless the "log" itself is declared
static final and scala compiler will strip away the "if" check and produce
efficient byte code. Also log.info does the log.isInfoEnabled check inside
the body prior to logging the messages.

https://github.com/qos-ch/slf4j/blob/master/slf4j-simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L509
https://github.com/qos-ch/slf4j/blob/master/slf4j-simple/src/main/java/org/slf4j/simple/SimpleLogger.java#L599

Please correct me if I am wrong.




On Sun, Jun 25, 2017 at 3:04 AM, Sean Owen <so...@cloudera.com> wrote:

> Maybe you are looking for declarations like this. "=> String" means the
> arg isn't evaluated until it's used, which is just what you want with log
> statements. The message isn't constructed unless it will be logged.
>
> protected def logInfo(msg: => String) {
>
>
> On Sun, Jun 25, 2017 at 10:28 AM kant kodali <kanth...@gmail.com> wrote:
>
>> Hi All,
>>
>> I came across this file https://github.com/apache/spark/blob/master/core/
>> src/main/scala/org/apache/spark/internal/Logging.scala and I am
>> wondering what is the purpose of this? Especially it doesn't prevent any
>> string concatenation and also the if checks are already done by the library
>> itself right?
>>
>> Thanks!
>>
>>


Question on Spark code

2017-06-25 Thread kant kodali
Hi All,

I came across this file
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala
and I am wondering what is the purpose of this? Especially it doesn't
prevent any string concatenation and also the if checks are already done by
the library itself right?

Thanks!


Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread kant kodali
Even if I do simple count aggregation like below I get the same error as
https://issues.apache.org/jira/browse/SPARK-19268

Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"),
"24 hours", "24 hours"), df1.col("AppName")).count();


On Wed, May 24, 2017 at 3:35 PM, kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I am using Spark 2.1.1 and running in a Standalone mode using HDFS and
> Kafka
>
> I am running into the same problem as https://issues.apache.org/
> jira/browse/SPARK-19268 with my app(not KafkaWordCount).
>
> Here is my sample code
>
> *Here is how I create ReadStream*
>
> sparkSession.readStream()
> .format("kafka")
> .option("kafka.bootstrap.servers", 
> config.getString("kafka.consumer.settings.bootstrapServers"))
> .option("subscribe", 
> config.getString("kafka.consumer.settings.topicName"))
> .option("startingOffsets", "earliest")
> .option("failOnDataLoss", "false")
> .option("checkpointLocation", hdfsCheckPointDir)
> .load();
>
>
> *The core logic*
>
> Dataset df = ds.select(from_json(new Column("value").cast("string"), 
> client.getSchema()).as("payload"));
> Dataset df1 = df.selectExpr("payload.info.*", "payload.data.*");
> Dataset df2 = df1.groupBy(window(df1.col("Timestamp5"), "24 hours", "24 
> hours"), df1.col("AppName")).agg(sum("Amount"));
> StreamingQuery query = df1.writeStream().foreach(new 
> KafkaSink()).outputMode("update").start();
> query.awaitTermination();
>
>
> I can also provide any other information you may need.
>
> Thanks!
>


Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread kant kodali
Hi All,

I am using Spark 2.1.1 and running in a Standalone mode using HDFS and
Kafka

I am running into the same problem as
https://issues.apache.org/jira/browse/SPARK-19268 with my app(not
KafkaWordCount).

Here is my sample code

*Here is how I create ReadStream*

sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers",
config.getString("kafka.consumer.settings.bootstrapServers"))
.option("subscribe",
config.getString("kafka.consumer.settings.topicName"))
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.option("checkpointLocation", hdfsCheckPointDir)
.load();


*The core logic*

Dataset df = ds.select(from_json(new
Column("value").cast("string"), client.getSchema()).as("payload"));
Dataset df1 = df.selectExpr("payload.info.*", "payload.data.*");
Dataset df2 = df1.groupBy(window(df1.col("Timestamp5"), "24
hours", "24 hours"), df1.col("AppName")).agg(sum("Amount"));
StreamingQuery query = df1.writeStream().foreach(new
KafkaSink()).outputMode("update").start();
query.awaitTermination();


I can also provide any other information you may need.

Thanks!


Spark 2.2.0 or Spark 2.3.0?

2017-05-01 Thread kant kodali
Hi All,

If I understand the Spark standard release process correctly. It looks like
the official release is going to be sometime end of this month and it is
going to be 2.2.0 right (not 2.3.0)? I am eagerly looking for Spark 2.2.0
because of the "update mode" option in Spark Streaming. Please correct me
if I am wrong.

Thanks!


Re: is there a way to persist the lineages generated by spark?

2017-04-07 Thread kant kodali
yes Lineage that is actually replayable is what is needed for Validation
process. So we can address questions like how a system arrived at a state S
at a time T. I guess a good analogy is event sourcing.


On Thu, Apr 6, 2017 at 10:30 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> I do think this is the right way, you will have to do testing with test
> data verifying that the expected output of the calculation is the output.
> Even if the logical Plan Is correct your calculation might not be. E.g.
> There can be bugs in Spark, in the UI or (what is very often) the client
> describes a calculation, but in the end the description is wrong.
>
> > On 4. Apr 2017, at 05:19, kant kodali <kanth...@gmail.com> wrote:
> >
> > Hi All,
> >
> > I am wondering if there a way to persist the lineages generated by spark
> underneath? Some of our clients want us to prove if the result of the
> computation that we are showing on a dashboard is correct and for that If
> we can show the lineage of transformations that are executed to get to the
> result then that can be the Q.E.D moment but I am not even sure if this is
> even possible with spark?
> >
> > Thanks,
> > kant
>


is there a way to persist the lineages generated by spark?

2017-04-03 Thread kant kodali
Hi All,

I am wondering if there a way to persist the lineages generated by spark
underneath? Some of our clients want us to prove if the result of the
computation that we are showing on a dashboard is correct and for that If
we can show the lineage of transformations that are executed to get to the
result then that can be the Q.E.D moment but I am not even sure if this is
even possible with spark?

Thanks,
kant


Are we still dependent on Guava jar in Spark 2.1.0 as well?

2017-02-26 Thread kant kodali
Are we still dependent on Guava jar in Spark 2.1.0 as well (Given Guava jar
incompatibility issues)?


Re: Java 9

2017-02-07 Thread kant kodali
Well and the module system!

On Tue, Feb 7, 2017 at 4:03 AM, Timur Shenkao  wrote:

> If I'm not wrong, they got fid of   *sun.misc.Unsafe   *in Java 9.
>
> This class is till used by several libraries & frameworks.
>
> http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/
>
> On Tue, Feb 7, 2017 at 12:51 PM, Pete Robbins  wrote:
>
>> Yes, I agree but it may be worthwhile starting to look at this. I was
>> just trying a build and it trips over some of the now defunct/inaccessible
>> sun.misc classes.
>>
>> I was just interested in hearing if anyone has already gone through this
>> to save me duplicating effort.
>>
>> Cheers,
>>
>> On Tue, 7 Feb 2017 at 11:46 Sean Owen  wrote:
>>
>>> I don't think anyone's tried it. I think we'd first have to agree to
>>> drop Java 7 support before that could be seriously considered. The 8-9
>>> difference is a bit more of a breaking change.
>>>
>>> On Tue, Feb 7, 2017 at 11:44 AM Pete Robbins 
>>> wrote:
>>>
>>> Is anyone working on support for running Spark on Java 9? Is this in a
>>> roadmap anywhere?
>>>
>>>
>>> Cheers,
>>>
>>>
>


Re: Wrting data from Spark streaming to AWS Redshift?

2016-12-11 Thread kant kodali
@shyla  a side question: What does Redshift can do that Spark cannot do?
Trying to understand your use case.

On Fri, Dec 9, 2016 at 8:47 PM, ayan guha  wrote:

> Ideally, saving data to external sources should not be any different. give
> the write options as stated in the bloga shot, but changing mode to append.
>
> On Sat, Dec 10, 2016 at 8:25 AM, shyla deshpande  > wrote:
>
>> Hello all,
>>
>> Is it possible to Write data from Spark streaming to AWS Redshift?
>>
>> I came across the following article, so looks like it works from a Spark
>> batch program.
>>
>> https://databricks.com/blog/2015/10/19/introducing-redshift-
>> data-source-for-spark.html
>>
>> I want to write to AWS Redshift from Spark Stream. Please share your
>> experience and reference docs.
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-17 Thread kant kodali
Can we have a JSONType for Spark SQL?

On Wed, Nov 16, 2016 at 8:41 PM, Nathan Lande <nathanla...@gmail.com> wrote:

> If you are dealing with a bunch of different schemas in 1 field, figuring
> out a strategy to deal with that will depend on your data and does not
> really have anything to do with spark since mapping your JSON payloads to
> tractable data structures will depend on business logic.
>
> The strategy of pulling out a blob into its on rdd and feeding it into the
> JSON loader should work for any data source once you have your data
> strategy figured out.
>
> On Wed, Nov 16, 2016 at 4:39 PM, kant kodali <kanth...@gmail.com> wrote:
>
>> 1. I have a Cassandra Table where one of the columns is blob. And this
>> blob contains a JSON encoded String however not all the blob's across the
>> Cassandra table for that column are same (some blobs have difference json's
>> than others) so In that case what is the best way to approach it? Do we
>> need to put /group all the JSON Blobs that have same structure (same keys)
>> into each individual data frame? For example, say if I have 5 json blobs
>> that have same structure and another 3 JSON blobs that belongs to some
>> other structure In this case do I need to create two data frames? (Attached
>> is a screen shot of 2 rows of how my json looks like)
>> 2. In my case, toJSON on RDD doesn't seem to help a lot. Attached a
>> screen shot. Looks like I got the same data frame as my original one.
>>
>> Thanks much for these examples.
>>
>>
>>
>> On Wed, Nov 16, 2016 at 2:54 PM, Nathan Lande <nathanla...@gmail.com>
>> wrote:
>>
>>> I'm looking forward to 2.1 but, in the meantime, you can pull out the
>>> specific column into an RDD of JSON objects, pass this RDD into the
>>> read.json() and then join the results back onto your initial DF.
>>>
>>> Here is an example of what we do to unpack headers from Avro log data:
>>>
>>> def jsonLoad(path):
>>> #
>>> #load in the df
>>> raw = (sqlContext.read
>>> .format('com.databricks.spark.avro')
>>> .load(path)
>>> )
>>> #
>>> #define json blob, add primary key elements (hi and lo)
>>> #
>>> JSONBlob = concat(
>>> lit('{'),
>>> concat(lit('"lo":'), col('header.eventId.lo').cast('string'),
>>> lit(',')),
>>> concat(lit('"hi":'), col('header.eventId.hi').cast('string'),
>>> lit(',')),
>>> concat(lit('"response":'), decode('requestResponse.response',
>>> 'UTF-8')),
>>> lit('}')
>>> )
>>> #
>>> #extract the JSON blob as a string
>>> rawJSONString = raw.select(JSONBlob).rdd.map(lambda x: str(x[0]))
>>> #
>>> #transform the JSON string into a DF struct object
>>> structuredJSON = sqlContext.read.json(rawJSONString)
>>> #
>>> #join the structured JSON back onto the initial DF using the hi and
>>> lo join keys
>>> final = (raw.join(structuredJSON,
>>> ((raw['header.eventId.lo'] == structuredJSON['lo']) &
>>> (raw['header.eventId.hi'] == structuredJSON['hi'])),
>>> 'left_outer')
>>> .drop('hi')
>>> .drop('lo')
>>> )
>>> #
>>> #win
>>> return final
>>>
>>> On Wed, Nov 16, 2016 at 10:50 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon <gurwls...@gmail.com>
>>>> wrote:
>>>>
>>>>> Maybe it sounds like you are looking for from_json/to_json functions
>>>>> after en/decoding properly.
>>>>>
>>>>
>>>> Which are new built-in functions that will be released with Spark 2.1.
>>>>
>>>
>>>
>>
>


Another Interesting Question on SPARK SQL

2016-11-17 Thread kant kodali
​
Which parts in the diagram above are executed by DataSource connectors and
which parts are executed by Tungsten? or to put it in another way which
phase in the diagram above does Tungsten leverages the Datasource
connectors (such as say cassandra connector ) ?

My understanding so far is that connectors come in during Physical planning
phase but I am not sure if the connectors take logical plan as an input?

Thanks,
kant


How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread kant kodali
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#json-datasets

"Spark SQL can automatically infer the schema of a JSON dataset and load it
as a DataFrame. This conversion can be done using SQLContext.read.json() on
either an RDD of String, or a JSON file."

val df = spark.sql("SELECT json_encoded_blob_column from table_name"); // A
cassandra query (cassandra stores blobs in hexadecimal )
json_encoded_blob_column
is encoded in hexadecimal. It will be great to have these blobs interpreted
and be loaded as a data frame but for now is there anyway to load or parse
json_encoded_blob_column into a data frame?


Re: Spark Improvement Proposals

2016-10-12 Thread kant kodali
Some of you guys may have already seen this but in case if you haven't you
may want to check it out.

http://www.slideshare.net/sbaltagi/flink-vs-spark



On Tue, Oct 11, 2016 at 1:57 PM, Ryan Blue 
wrote:

> I don't think we will have trouble with whatever rule that is adopted for
> accepting proposals. Considering committers' votes binding (if that is what
> we choose) is an established practice as long as it isn't for specific
> votes, like a release vote. From the Apache docs: "Who is permitted to vote
> is, to some extent, a community-specific thing." [1] And, I also don't see
> why it would be a problem to choose consensus, as long as we have an open
> discussion and vote about these rules.
>
> rb
>
> On Mon, Oct 10, 2016 at 4:15 PM, Cody Koeninger 
> wrote:
>
>> If someone wants to tell me that it's OK and "The Apache Way" for
>> Kafka and Flink to have a proposal process that ends in a lazy
>> majority, but it's not OK for Spark to have a proposal process that
>> ends in a non-lazy consensus...
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
>> Improvement+Proposals#KafkaImprovementProposals-Process
>>
>> In practice any PMC member can stop a proposal they don't like, so I'm
>> not sure how much it matters.
>>
>>
>>
>> On Mon, Oct 10, 2016 at 5:59 PM, Mark Hamstra 
>> wrote:
>> > There is a larger issue to keep in mind, and that is that what you are
>> > proposing is a procedure that, as far as I am aware, hasn't previously
>> been
>> > adopted in an Apache project, and thus is not an easy or exact fit with
>> > established practices that have been blessed as "The Apache Way".  As
>> such,
>> > we need to be careful, because we have run into some trouble in the past
>> > with some inside the ASF but essentially outside the Spark community who
>> > didn't like the way we were doing things.
>> >
>> > On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger 
>> wrote:
>> >>
>> >> Apache documents say lots of confusing stuff, including that commiters
>> are
>> >> in practice given a vote.
>> >>
>> >> https://www.apache.org/foundation/voting.html
>> >>
>> >> I don't care either way, if someone wants me to sub commiter for PMC in
>> >> the voting section, fine, we just need a clear outcome.
>> >>
>> >>
>> >> On Oct 10, 2016 17:36, "Mark Hamstra"  wrote:
>> >>>
>> >>> If I'm correctly understanding the kind of voting that you are talking
>> >>> about, then to be accurate, it is only the PMC members that have a
>> vote, not
>> >>> all committers:
>> >>> https://www.apache.org/foundation/how-it-works.html#pmc-members
>> >>>
>> >>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger 
>> >>> wrote:
>> 
>>  I think the main value is in being honest about what's going on.  No
>>  one other than committers can cast a meaningful vote, that's the
>>  reality.  Beyond that, if people think it's more open to allow formal
>>  proposals from anyone, I'm not necessarily against it, but my main
>>  question would be this:
>> 
>>  If anyone can submit a proposal, are committers actually going to
>>  clearly reject and close proposals that don't meet the requirements?
>> 
>>  Right now we have a serious problem with lack of clarity regarding
>>  contributions, and that cannot spill over into goal-setting.
>> 
>>  On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue 
>> wrote:
>>  > +1 to votes to approve proposals. I agree that proposals should
>> have
>>  > an
>>  > official mechanism to be accepted, and a vote is an established
>> means
>>  > of
>>  > doing that well. I like that it includes a period to review the
>>  > proposal and
>>  > I think proposals should have been discussed enough ahead of a
>> vote to
>>  > survive the possibility of a veto.
>>  >
>>  > I also like the names that are short and (mostly) unique, like SEP.
>>  >
>>  > Where I disagree is with the requirement that a committer must
>>  > formally
>>  > propose an enhancement. I don't see the value of restricting this:
>> if
>>  > someone has the will to write up a proposal then they should be
>>  > encouraged
>>  > to do so and start a discussion about it. Even if there is a
>> political
>>  > reality as Cody says, what is the value of codifying that in our
>>  > process? I
>>  > think restricting who can submit proposals would only undermine
>> them
>>  > by
>>  > pushing contributors out. Maybe I'm missing something here?
>>  >
>>  > rb
>>  >
>>  >
>>  >
>>  > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger <
>> c...@koeninger.org>
>>  > wrote:
>>  >>
>>  >> Yes, users suggesting SIPs is a good thing and is explicitly
>> called
>>  >> out in the linked document under the Who? section.  Formally
>>  >> proposing
>> 

Re: This Exception has been really hard to trace

2016-10-10 Thread kant kodali
Hi
I use gradle and I don't think it really has "provided" but I was able to google
and create the following file but the same error still persist.
group 'com.company'version '1.0-SNAPSHOT'
apply plugin: 'java'apply plugin: 'idea'
repositories {mavenCentral()mavenLocal()}
configurations {provided}sourceSets {main {compileClasspath +=
configurations.providedtest.compileClasspath += configurations.provided
test.runtimeClasspath += configurations.provided}}
idea {module {scopes.PROVIDED.plus += [ configurations.provided ]
}}
dependencies {compile 'org.slf4j:slf4j-log4j12:1.7.12'provided group:
'org.apache.spark', name: 'spark-core_2.11', version: '2.0.0'provided group:
'org.apache.spark', name: 'spark-streaming_2.11', version: '2.0.0'provided
group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.0.0'provided
group: 'com.datastax.spark', name: 'spark-cassandra-connector_2.11', version:
'2.0.0-M3'}


jar {from { configurations.provided.collect { it.isDirectory() ? it :
zipTree(it) } }   // with jarfrom sourceSets.test.outputmanifest {
attributes 'Main-Class': "com.company.batchprocessing.Hello"}
exclude 'META-INF/.RSA', 'META-INF/.SF', 'META-INF/*.DSA'zip64 true}
This successfully creates the jar but the error still persists.
 





On Sun, Oct 9, 2016 11:44 PM, Shixiong(Ryan) Zhu shixi...@databricks.com
wrote:
Seems the runtime Spark is different from the compiled one. You should mark the
Spark components  "provided". See
https://issues.apache.org/jira/browse/SPARK-9219
On Sun, Oct 9, 2016 at 8:13 PM, kant kodali <kanth...@gmail.com>  wrote:

I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.

http://qaoverflow.com/question/how-to-use-spanby-in-java/

java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

JavaPairRDD<ByteBuffer, Iterable> cassandraRowsRDD=javaFunctions
(sc).cassandraTable("test", "hello" )
.select("col1", "col2", "col3" )
.spanBy(newFunction<CassandraRow, ByteBuffer>() {
@Override
publicByteBuffer call(CassandraRow v1) {
returnv1.getBytes("rowkey");
}
}, ByteBuffer.class);

And then here I do this here is where the problem occurs
List<Tuple2<ByteBuffer, Iterable>> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2<ByteBuffer, Iterable> tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
so I tried this  and same error
Iterable<Tuple2<ByteBuffer, Iterable>> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2<ByteBuffer, Iterable> tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
Although I understand that ByteBuffers aren't serializable I didn't get any not
serializable exception but still I went head and changed everything to byte[] so
no more ByteBuffers in the code.
I have also tried cassandraRowsRDD.collect().forEach() and
cassandraRowsRDD.stream().forEachPartition() and the same exact error occurs.
I am running everything locally and in a stand alone mode so my spark cluster is
just running on localhost.
Scala code runner version 2.11.8  // when I run scala -version or even
./spark-shell

compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
version: '2.0.0-M3':

So I don't see anything wrong with these versions.
2) I am bundling everything into one jar and so far it did worked out well
except for this error.
I am using Java 8 and Gradle.

any ideas on how I can fix this?

Re: This Exception has been really hard to trace

2016-10-09 Thread kant kodali
Hi Reynold,
Actually, I did that a well before posting my question here.
Thanks,kant
 





On Sun, Oct 9, 2016 8:48 PM, Reynold Xin r...@databricks.com
wrote:
You should probably check with DataStax who build the Cassandra connector for
Spark.


On Sun, Oct 9, 2016 at 8:13 PM, kant kodali <kanth...@gmail.com>  wrote:

I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.

http://qaoverflow.com/question/how-to-use-spanby-in-java/

java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

JavaPairRDD<ByteBuffer, Iterable> cassandraRowsRDD=javaFunctions
(sc).cassandraTable("test", "hello" )
.select("col1", "col2", "col3" )
.spanBy(newFunction<CassandraRow, ByteBuffer>() {
@Override
publicByteBuffer call(CassandraRow v1) {
returnv1.getBytes("rowkey");
}
}, ByteBuffer.class);

And then here I do this here is where the problem occurs
List<Tuple2<ByteBuffer, Iterable>> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2<ByteBuffer, Iterable> tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
so I tried this  and same error
Iterable<Tuple2<ByteBuffer, Iterable>> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2<ByteBuffer, Iterable> tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
Although I understand that ByteBuffers aren't serializable I didn't get any not
serializable exception but still I went head and changed everything to byte[] so
no more ByteBuffers in the code.
I have also tried cassandraRowsRDD.collect().forEach() and
cassandraRowsRDD.stream().forEachPartition() and the same exact error occurs.
I am running everything locally and in a stand alone mode so my spark cluster is
just running on localhost.
Scala code runner version 2.11.8  // when I run scala -version or even
./spark-shell

compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
version: '2.0.0-M3':

So I don't see anything wrong with these versions.
2) I am bundling everything into one jar and so far it did worked out well
except for this error.
I am using Java 8 and Gradle.

any ideas on how I can fix this?

This Exception has been really hard to trace

2016-10-09 Thread kant kodali
I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.

http://qaoverflow.com/question/how-to-use-spanby-in-java/

java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

JavaPairRDD cassandraRowsRDD=javaFunctions
(sc).cassandraTable("test", "hello" )
.select("col1", "col2", "col3" )
.spanBy(newFunction() {
@Override
publicByteBuffer call(CassandraRow v1) {
returnv1.getBytes("rowkey");
}
}, ByteBuffer.class);

And then here I do this here is where the problem occurs
List> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
so I tried this  and same error
Iterable> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
Although I understand that ByteBuffers aren't serializable I didn't get any not
serializable exception but still I went head and changed everything to byte[] so
no more ByteBuffers in the code.
I have also tried cassandraRowsRDD.collect().forEach() and
cassandraRowsRDD.stream().forEachPartition() and the same exact error occurs.
I am running everything locally and in a stand alone mode so my spark cluster is
just running on localhost.
Scala code runner version 2.11.8  // when I run scala -version or even
./spark-shell

compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
version: '2.0.0-M3':

So I don't see anything wrong with these versions.
2) I am bundling everything into one jar and so far it did worked out well
except for this error.
I am using Java 8 and Gradle.

any ideas on how I can fix this?

Fwd: seeing this message repeatedly.

2016-09-05 Thread kant kodali

-- Forwarded message --
From: kant kodali  <kanth...@gmail.com>
Date: Sat, Sep 3, 2016 at 5:39 PM
Subject: seeing this message repeatedly.
To: "user @spark" <u...@spark.apache.org>



Hi Guys,
I am running my driver program on my local machine and my spark cluster is on
AWS. The big question is I don't know what are the right settings to get around
this public and private ip thing on AWS? my spark-env.sh currently has the the
following lines
export  SPARK_PUBLIC_DNS="52.44.36.224"export  SPARK_WORKER_CORES=12export 
SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"

I am seeing the lines below when I run my driver program on my local machine.
not sure what is going on ?


16/09/03 17:32:15 INFO DAGScheduler: Submitting 50 missing tasks from
ShuffleMapStage 0 (MapPartitionsRDD[1] at start at Consumer.java:41)16/09/03
17:32:15 INFO TaskSchedulerImpl: Adding task set 0.0 with 50 tasks16/09/03
17:32:30 WARN TaskSchedulerImpl: Initial job has not accepted any resources;
check your cluster UI to ensure that workers are registered and have sufficient
resources16/09/03 17:32:45 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered and
have sufficient resources

Re: What are the names of the network protocols used between Spark Driver, Master and Workers?

2016-08-30 Thread kant kodali

Ok I will answer my own question. Looks like Netty based RPC





On Mon, Aug 29, 2016 9:22 PM, kant kodali kanth...@gmail.com wrote:
What are the names of the network protocols used between Spark Driver, Master
and Workers?

Re: is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-24 Thread kant kodali

can you please elaborate a bit more?





On Wed, Aug 24, 2016 12:41 AM, Sean Owen so...@cloudera.com wrote:
Byte code, no. It's sufficient to store the information that the RDD represents,
which can include serialized function closures, but that's not quite storing
byte code.
On Wed, Aug 24, 2016 at 2:00 AM, kant kodali < kanth...@gmail.com > wrote:
Hi Guys,
I have this question for a very long time and after diving into the source
code(specifically from the links below) I have a feeling that the lineage of an
RDD (the transformations) are converted into byte code and stored in memory or
disk. or if I were to ask another question on a similar note do we ever store
JVM byte code or python byte code in memory or disk? This make sense to me
because if we were to construct an RDD after a node failure we need to go
through the lineage and execute the respective transformations so storing their
byte codes does make sense however many people seem to disagree with me so it
would be great if someone can clarify.
https://github.com/apache/ spark/blob/ 6ee40d2cc5f467c78be662c1639fc3
d5b7f796cf/python/pyspark/rdd. py#L1452
https://github.com/apache/ spark/blob/ 6ee40d2cc5f467c78be662c1639fc3
d5b7f796cf/python/pyspark/rdd. py#L1471
https://github.com/apache/ spark/blob/ 6ee40d2cc5f467c78be662c1639fc3
d5b7f796cf/python/pyspark/rdd. py#L229
https://github.com/apache/ spark/blob/master/python/ pyspark/cloudpickle.py#L241

is the Lineage of RDD stored as a byte code in memory or a file?

2016-08-23 Thread kant kodali

Hi Guys,
I have this question for a very long time and after diving into the source
code(specifically from the links below) I have a feeling that the lineage of an
RDD (the transformations) are converted into byte code and stored in memory or
disk. or if I were to ask another question on a similar note do we ever store
JVM byte code or python byte code in memory or disk? This make sense to me
because if we were to construct an RDD after a node failure we need to go
through the lineage and execute the respective transformations so storing their
byte codes does make sense however many people seem to disagree with me so it
would be great if someone can clarify.

https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3d5b7f796cf/python/pyspark/rdd.py#L1452

https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3d5b7f796cf/python/pyspark/rdd.py#L1471

https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c1639fc3d5b7f796cf/python/pyspark/rdd.py#L229
https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py#L241