from:"Shixiong\(Ryan\) Zhu"

Re: [vote] Apache Spark 3.0 RC3

2020-06-09 Thread Shixiong(Ryan) Zhu

+1 (binding)

Best Regards,
Ryan


On Tue, Jun 9, 2020 at 4:24 AM Wenchen Fan  wrote:

> +1 (binding)
>
> On Tue, Jun 9, 2020 at 6:15 PM Dr. Kent Yao  wrote:
>
>> +1 (non-binding)
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] "complete" streaming output mode

2020-05-20 Thread Shixiong(Ryan) Zhu

Hey Jungtaek,

I totally agree with you about the issues of the complete mode you raised
here. However, not all streaming queries have unbounded states and
will grow quickly to a crazy state.

Actually, I found the complete mode is pretty useful when the states are
bounded and small. For example, a user can build a realtime dashboard based
on daily aggregation results (only 365 or 366 keys in one year, so less
than 40k keys in 100 years) using memory sink in the following steps:

- Write a streaming query to poll data from Kafka, calculate the
aggregation results, and save to the memory sink in the complete mode.
- In the same Spark application, start a thrift server with
"spark.sql.hive.thriftServer.singleSession=true" to expose the temp table
created by the memory sink through JDBC/ODBC.
- Connect a BI tool using JDBC/ODBC to query the temp table created by the
memory sink.
- Use the BI tool to build a realtime dashboard by polling the results in a
specified speed.

Best Regards,
Ryan


On Mon, May 18, 2020 at 8:44 PM Jungtaek Lim 
wrote:

> Hi devs,
>
> while dealing with SPARK-31706 [1] we figured out the streaming output
> mode is only effective for stateful aggregation and not guaranteed on sink,
> which could expose data loss issue. SPARK-31724 [2] is filed to track the
> efforts on improving the streaming output mode.
>
> Before we revisit the streaming output mode, I'd like to initiate the
> discussion around "complete" streaming output mode first, because I have no
> idea how it works for production use case. For me, it's only useful for
> niche cases and no other streaming framework has such concept.
>
> 1. It destroys the purpose of watermark and forces Spark to maintain all
> of state rows, growing incrementally. It only works when all keys are
> bounded to the limited set.
>
> 2. It has to provide all state rows as outputs per batch, hence the size
> of outputs is also growing incrementally.
>
> 3. It has to truncate the target before putting rows which might not be
> trivial for external storage if it should be executed per batch.
>
> 4. It enables some operations like sort on streaming query or couple of
> more things. But it will not work cleanly (state won't keep up) under
> reasonably high input rate, and we have to consider how the operation will
> work for streaming output mode hence non-trivial amount of consideration
> has to be added to maintain the mode.
>
> It would be a headache to retain the complete mode if we consider
> improving modes, as someone might concern about compatibility. It would be
> nice if we can make a consensus on the viewpoint of complete mode and drop
> supporting it if we agree with.
>
> Would like to hear everyone's opinions. It would be great if someone
> brings the valid cases where complete mode is being used in production.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-31706
> 2. https://issues.apache.org/jira/browse/SPARK-31724
>
>
>

Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Shixiong(Ryan) Zhu

"spark.sql("set -v")" returns a Dataset that has all non-internal SQL
configurations. Should be pretty easy to automatically generate a SQL
configuration page.

Best Regards,
Ryan


On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:

> I think automatically creating a configuration page isn't a bad idea
> because I think we deprecate and remove configurations which are not
> created via .internal() in SQLConf anyway.
>
> I already tried this automatic generation from the codes at SQL built-in
> functions and I'm pretty sure we can do the similar thing for
> configurations as well.
>
> We could perhaps mimic what hadoop does
> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>
> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>
>> Some of it is intentionally undocumented, as far as I know, as an
>> experimental option that may change, or legacy, or safety valve flag.
>> Certainly anything that's marked an internal conf. (That does raise
>> the question of who it's for, if you have to read source to find it.)
>>
>> I don't know if we need to overhaul the conf system, but there may
>> indeed be some confs that could legitimately be documented. I don't
>> know which.
>>
>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>  wrote:
>> >
>> > I filed SPARK-30510 thinking that we had forgotten to document an
>> option, but it turns out that there's a whole bunch of stuff under
>> SQLConf.scala that has no public documentation under
>> http://spark.apache.org/docs.
>> >
>> > Would it be appropriate to somehow automatically generate a
>> documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
>> >
>> > Another thought that comes to mind is moving the config definitions out
>> of Scala and into a data format like YAML or JSON, and then sourcing that
>> both for SQLConf as well as for whatever documentation page we want to
>> generate. What do you think of that idea?
>> >
>> > Nick
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Adding JIRA ID as the prefix for the test case name

2019-11-14 Thread Shixiong(Ryan) Zhu

Should we also add a guideline for non Scala tests? Other languages (Java,
Python, R) don't support using string as a test name.

Best Regards,
Ryan


On Thu, Nov 14, 2019 at 4:04 AM Hyukjin Kwon  wrote:

> I opened a PR - https://github.com/apache/spark-website/pull/231
>
> 2019년 11월 13일 (수) 오전 10:43, Hyukjin Kwon 님이 작성:
>
>> > In general a test should be self descriptive and I don't think we
>> should be adding JIRA ticket references wholesale. Any action that the
>> reader has to take to understand why a test was introduced is one too many.
>> However in some cases the thing we are trying to test is very subtle and in
>> that case a reference to a JIRA ticket might be useful, I do still feel
>> that this should be a backstop and that properly documenting your tests is
>> a much better way of dealing with this.
>>
>> Yeah, the test should be self-descriptive. I don't think adding a JIRA
>> prefix harms this point. Probably I should add this sentence in the
>> guidelines as well.
>> Adding a JIRA prefix just adds one extra hint to track down details. I
>> think it's fine to stick to this practice and make it simpler and clear to
>> follow.
>>
>> > 1. what if multiple JIRA IDs relating to the same test? we just take
>> the very first JIRA ID?
>> Ideally one JIRA should describe one issue and one PR should fix one JIRA
>> with a dedicated test.
>> Yeah, I think I would take the very first JIRA ID.
>>
>> > 2. are we going to have a full scan of all existing tests and attach a
>> JIRA ID to it?
>> Yea, let's don't do this.
>>
>> > It's a nice-to-have, not super essential, just because ...
>> It's been asked multiple times and each committer seems having a
>> different understanding on this.
>> It's not a biggie but wanted to make it clear and conclude this.
>>
>> > I'd add this only when a test specifically targets a certain issue.
>> Yes, so this one I am not sure. From what I heard, people adds the JIRA
>> in cases below:
>>
>> - Whenever the JIRA type is a bug
>> - When a PR adds a couple of tests
>> - Only when a test specifically targets a certain issue.
>> - ...
>>
>> Which one do we prefer and simpler to follow?
>>
>> Or I can combine as below (im gonna reword when I actually document this):
>> 1. In general, we should add a JIRA ID as prefix of a test when a PR
>> targets to fix a specific issue.
>> In practice, it usually happens when a JIRA type is a bug or a PR
>> adds a couple of tests.
>> 2. Uses "SPARK-: test name" format
>>
>> If we have no objection with ^, let me go with this.
>>
>> 2019년 11월 13일 (수) 오전 8:14, Sean Owen 님이 작성:
>>
>>> Let's suggest "SPARK-12345:" but not go back and change a bunch of test
>>> cases.
>>> I'd add this only when a test specifically targets a certain issue.
>>> It's a nice-to-have, not super essential, just because in the rare
>>> case you need to understand why a test asserts something, you can go
>>> back and find what added it in the git history without much trouble.
>>>
>>> On Mon, Nov 11, 2019 at 10:46 AM Hyukjin Kwon 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Maybe it's not a big deal but it brought some confusions time to time
>>> into Spark dev and community. I think it's time to discuss about when/which
>>> format to add a JIRA ID as a prefix for the test case name in Scala test
>>> cases.
>>> >
>>> > Currently we have many test case names with prefixes as below:
>>> >
>>> > test("SPARK-X blah blah")
>>> > test("SPARK-X: blah blah")
>>> > test("SPARK-X - blah blah")
>>> > test("[SPARK-X] blah blah")
>>> > …
>>> >
>>> > It is a good practice to have the JIRA ID in general because, for
>>> instance,
>>> > it makes us put less efforts to track commit histories (or even when
>>> the files
>>> > are totally moved), or to track related information of tests failed.
>>> > Considering Spark's getting big, I think it's good to document.
>>> >
>>> > I would like to suggest this and document it in our guideline:
>>> >
>>> > 1. Add a prefix into a test name when a PR adds a couple of tests.
>>> > 2. Uses "SPARK-: test name" format which is used in our code base
>>> most
>>> >   often[1].
>>> >
>>> > We should make it simple and clear but closer to the actual practice.
>>> So, I would like to listen to what other people think. I would appreciate
>>> if you guys give some feedback about when to add the JIRA prefix. One
>>> alternative is that, we only add the prefix when the JIRA's type is bug.
>>> >
>>> > [1]
>>> > git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>>> >  923
>>> > git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>>> >  477
>>> > git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>>> >   16
>>> > git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>>> >   13
>>> >
>>> >
>>> >
>>>
>>

Re: Why two netty libs?

2019-09-03 Thread Shixiong(Ryan) Zhu

Yep, historical reasons. And Netty 4 is under another namespace, so we can
use Netty 3 and Netty 4 in the same JVM.

On Tue, Sep 3, 2019 at 6:15 AM Sean Owen  wrote:

> It was for historical reasons; some other transitive dependencies needed
> it.
> I actually was just able to exclude Netty 3 last week from master.
> Spark uses Netty 4.
>
> On Tue, Sep 3, 2019 at 6:59 AM Jacek Laskowski  wrote:
> >
> > Hi,
> >
> > Just noticed that Spark 2.4.x uses two netty deps of different versions.
> Why?
> >
> > jars/netty-all-4.1.17.Final.jar
> > jars/netty-3.9.9.Final.jar
> >
> > Shouldn't one be excluded or perhaps shaded?
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://about.me/JacekLaskowski
> > The Internals of Spark SQL https://bit.ly/spark-sql-internals
> > The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> > The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> > Follow me at https://twitter.com/jaceklaskowski
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --

Best Regards,
Ryan

Re: [SS] KafkaSource doesn't use KafkaSourceInitialOffsetWriter for initial offsets?

2019-08-26 Thread Shixiong(Ryan) Zhu

We were worried about regression when adding Kafka source v2 because it had
lots of changes. Hence we copy-pasted codes to keep the Kafka source v1
untouched and provided a config to fallback to v1.

On Mon, Aug 26, 2019 at 7:05 AM Jungtaek Lim  wrote:

> Thanks! The patch is here: https://github.com/apache/spark/pull/25583
>
> On Mon, Aug 26, 2019 at 11:02 PM Gabor Somogyi 
> wrote:
>
>> Just checked this and it's a copy-paste :) It works properly when
>> KafkaSourceInitialOffsetWriter used. Pull me in if review needed.
>>
>> BR,
>> G
>>
>>
>> On Mon, Aug 26, 2019 at 3:57 PM Jungtaek Lim  wrote:
>>
>>> Nice finding! I don't see any reason to not use
>>> KafkaSourceInitialOffsetWriter from KafkaSource, as they're identical. I
>>> guess it was copied and pasted sometime before and not addressed yet.
>>> As you haven't submit a patch, I'll submit a patch shortly, with
>>> mentioning credit. I'd close mine and wait for your patch if you plan to do
>>> it. Please let me know.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Mon, Aug 26, 2019 at 8:03 PM Jacek Laskowski  wrote:
>>>
 Hi,

 Just found out that KafkaSource [1] does not
 use KafkaSourceInitialOffsetWriter (of KafkaMicroBatchStream) [2] for
 initial offsets.

 Any reason for that? Should I report an issue? Just checking out as I'm
 with 2.4.3 exclusively and have no idea what's coming for 3.0.

 [1]
 https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala#L102

 [2]
 https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L281

 Pozdrawiam,
 Jacek Laskowski
 
 https://about.me/JacekLaskowski
 The Internals of Spark SQL https://bit.ly/spark-sql-internals
 The Internals of Spark Structured Streaming
 https://bit.ly/spark-structured-streaming
 The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
 Follow me at https://twitter.com/jaceklaskowski


>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>
-- 

Best Regards,
Ryan

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-22 Thread Shixiong(Ryan) Zhu

+1 I have tested it and looks good!

Best Regards,
Ryan


On Sun, Apr 21, 2019 at 8:49 PM Wenchen Fan  wrote:

> Yea these should be mentioned in the 2.4.1 release notes.
>
> It seems we only have one ticket that is labeled as "release-notes" for
> 2.4.2: https://issues.apache.org/jira/browse/SPARK-27419 . I'll mention
> it when I write release notes.
>
> On Mon, Apr 22, 2019 at 5:46 AM Sean Owen  wrote:
>
>> One minor comment: for 2.4.1 we had a couple JIRAs marked 'release-notes':
>>
>> https://issues.apache.org/jira/browse/SPARK-27198?jql=project%20%3D%20SPARK%20and%20fixVersion%20%20in%20(2.4.1%2C%202.4.2)%20and%20labels%20%3D%20%27release-notes%27
>>
>> They should be mentioned in
>> https://spark.apache.org/releases/spark-release-2-4-1.html possibly
>> like "Changes of behavior" in
>> https://spark.apache.org/releases/spark-release-2-4-0.html
>>
>> I can retroactively update that page; is this part of the notes for
>> the release process though? I missed this one for sure as it's easy to
>> overlook with all the pages being updated per release.
>>
>> On Thu, Apr 18, 2019 at 9:51 PM Wenchen Fan  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 2.4.2.
>> >
>> > The vote is open until April 23 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 2.4.2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v2.4.2-rc1 (commit
>> a44880ba74caab7a987128cb09c4bee41617770a):
>> > https://github.com/apache/spark/tree/v2.4.2-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1322/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.2-rc1-docs/
>> >
>> > The list of bug fixes going into 2.4.1 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12344996
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 2.4.2?
>> > ===
>> >
>> > The current list of open tickets targeted at 2.4.2 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.2
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>

Re: Scala type checking thread-safety issue, and global locks to resolve it

2019-03-15 Thread Shixiong(Ryan) Zhu

Forgot to link the ticket that removed the global ScalaReflectionLock:
https://issues.apache.org/jira/browse/SPARK-19810

Best Regards,
Ryan


On Fri, Mar 15, 2019 at 10:40 AM Shixiong(Ryan) Zhu 
wrote:

> Hey Sean,
>
> Sounds good to me. At least, it's not worse than any versions prior to
> 2.3.0 which has a global ScalaReflectionLock. In addition, if someone hits
> a performance regression caused by this, they probably are creating too
> many Encoders. Reusing Encoders is a better solution for this case.
>
> Best regards,
> Shixiong
>
> On Thu, Mar 14, 2019 at 2:25 PM Sean Owen  wrote:
>
>> This is worth a look: https://github.com/apache/spark/pull/24085
>>
>> Scala has a surprising thread-safety bug in the "<:<" operator that's
>> used to check subtypes, which can lead to incorrect results in
>> non-trivial situations.
>>
>> The fix on the table is to introduce a global lock to protect a lot of
>> the Scala-related reflection code to resolve it. This may be the best
>> we can do, but 'global lock' sounds ominous.
>>
>> Any thoughts on other ways to resolve it? I'm not sure there are.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Scala type checking thread-safety issue, and global locks to resolve it

2019-03-15 Thread Shixiong(Ryan) Zhu

Hey Sean,

Sounds good to me. At least, it's not worse than any versions prior to
2.3.0 which has a global ScalaReflectionLock. In addition, if someone hits
a performance regression caused by this, they probably are creating too
many Encoders. Reusing Encoders is a better solution for this case.

Best regards,
Shixiong

On Thu, Mar 14, 2019 at 2:25 PM Sean Owen  wrote:

> This is worth a look: https://github.com/apache/spark/pull/24085
>
> Scala has a surprising thread-safety bug in the "<:<" operator that's
> used to check subtypes, which can lead to incorrect results in
> non-trivial situations.
>
> The fix on the table is to introduce a global lock to protect a lot of
> the Scala-related reflection code to resolve it. This may be the best
> we can do, but 'global lock' sounds ominous.
>
> Any thoughts on other ways to resolve it? I'm not sure there are.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-04 Thread Shixiong(Ryan) Zhu

-1. Found an issue in a new 2.4 Java API:
https://issues.apache.org/jira/browse/SPARK-25644 We should fix it in 2.4.0
to avoid future breaking changes.

Best Regards,
Ryan


On Mon, Oct 1, 2018 at 7:22 PM Michael Heuer  wrote:

> FYI I’ve open two new issues against 2.4.0 rc2
>
> https://issues.apache.org/jira/browse/SPARK-25587
> https://issues.apache.org/jira/browse/SPARK-25588
>
> that are regressions against 2.3.1, and may also be present in 2.3.2.
> They could use triage or review.
>
>michael
>
>
> On Oct 1, 2018, at 9:18 PM, Wenchen Fan  wrote:
>
> This RC fails because of the correctness bug: SPARK-25538
>
> I'll start a new RC once the fix(
> https://github.com/apache/spark/pull/22602) is merged.
>
> Thanks,
> Wenchen
>
> On Tue, Oct 2, 2018 at 1:21 AM Sean Owen  wrote:
>
>> Given that this release is probably still 2 weeks from landing, I don't
>> think that waiting on a spark-tensorflow-connector release with TF 1.12 in
>> mid-October is a big deal. Users can use the library with Spark 2.3.x for a
>> week or two before upgrading, if that's the case. I think this kind of bug
>> fix is appropriate for a minor release, while I could see trying to work
>> around to keep the buggy behavior in a maintenance release.
>> On Mon, Oct 1, 2018 at 12:11 PM Xiangrui Meng 
>> wrote:
>>
>>>
>>> IMHO, the use case (spark-tensorflow-connector) is very important. But
>>> whether we need to fix it in 2.4 branch depends on the release timeline.
>>> See my comment in the JIRA:
>>> https://issues.apache.org/jira/browse/SPARK-25378
>>>
>>>
>

Re: Support SqlStreaming in spark

2018-06-27 Thread Shixiong(Ryan) Zhu

Structured Streaming supports standard SQL as the batch queries, so the
users can switch their queries between batch and streaming easily. Could
you clarify what problems SqlStreaming solves and what are the benefits of
the new syntax?

Best Regards,
Ryan

On Thu, Jun 14, 2018 at 7:06 PM, JackyLee  wrote:

> Hello
>
> Nowadays, more and more streaming products begin to support SQL streaming,
> such as KafaSQL, Flink SQL and Storm SQL. To support SQL Streaming can not
> only reduce the threshold of streaming, but also make streaming easier to
> be
> accepted by everyone.
>
> At present, StructStreaming is relatively mature, and the StructStreaming
> is
> based on DataSet API, which make it possibal to  provide a SQL portal for
> structstreaming and run structstreaming in SQL.
>
> To support for SQL Streaming, there are two key points:
> 1, Analysis should be able to parse streaming type SQL.
> 2, Analyzer should be able to map metadata information to the corresponding
> Relation.
>
> Running StructStreaming in SQL can bring some benefits.
> 1, Reduce the entry threshold of StructStreaming and attract users more
> easily.
> 2, Encapsulate the meta information of source or sink into table, maintain
> and manage uniformly, and make users more accessible.
> 3. Metadata permissions management, which is based on hive, can control
> StructStreaming's overall authority management scheme more closely.
>
> We have found some ways to solve this problem. It's a pleasure to discuss
> it
> with you.
>
> Thanks,
>
> Jackey Lee
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Shixiong(Ryan) Zhu

FYI. I found two more blockers:

https://issues.apache.org/jira/browse/SPARK-23475
https://issues.apache.org/jira/browse/SPARK-23481

On Wed, Feb 21, 2018 at 9:45 AM, Xiao Li <gatorsm...@gmail.com> wrote:

> Hi, Ryan,
>
> In this release, Data Source V2 is experimental. We are still collecting
> the feedbacks from the community and will improve the related APIs and
> implementation in the next 2.4 release.
>
> Thanks,
>
> Xiao
>
> 2018-02-21 9:43 GMT-08:00 Xiao Li <gatorsm...@gmail.com>:
>
>> Hi, Justin,
>>
>> Based on my understanding, SPARK-17147 is also not a regression. Thus,
>> Spark 2.3.0 is unable to contain it. We have to wait for the committers who
>> are familiar with Spark Streaming to make a decision whether we can fix the
>> issue in Spark 2.3.1.
>>
>> Since this is open source, feel free to add the patch in your local build.
>>
>> Thanks for using Spark!
>>
>> Xiao
>>
>>
>> 2018-02-21 9:36 GMT-08:00 Ryan Blue <rb...@netflix.com.invalid>:
>>
>>> No problem if we can't add them, this is experimental anyway so this
>>> release should be more about validating the API and the start of our
>>> implementation. I just don't think we can recommend that anyone actually
>>> use DataSourceV2 without these patches.
>>>
>>> On Wed, Feb 21, 2018 at 9:21 AM, Wenchen Fan <cloud0...@gmail.com>
>>> wrote:
>>>
>>>> SPARK-23323 adds a new API, I'm not sure we can still do it at this
>>>> stage of the release... Besides users can work around it by calling the
>>>> spark output coordinator themselves in their data source.
>>>>
>>>> SPARK-23203 is non-trivial and didn't fix any known bugs, so it's hard
>>>> to convince other people that it's safe to add it to the release during the
>>>> RC phase.
>>>>
>>>> SPARK-23418 depends on the above one.
>>>>
>>>> Generally they are good to have in Spark 2.3, if they were merged
>>>> before the RC. I think this is a lesson we should learn from, that we
>>>> should work on stuff we want in the release before the RC, instead of 
>>>> after.
>>>>
>>>> On Thu, Feb 22, 2018 at 1:01 AM, Ryan Blue <rb...@netflix.com.invalid>
>>>> wrote:
>>>>
>>>>> What does everyone think about getting some of the newer DataSourceV2
>>>>> improvements in? It should be low risk because it is a new code path, and
>>>>> v2 isn't very usable without things like support for using the output
>>>>> commit coordinator to deconflict writes.
>>>>>
>>>>> The ones I'd like to get in are:
>>>>> * Use the output commit coordinator: https://issues.ap
>>>>> ache.org/jira/browse/SPARK-23323
>>>>> * Use immutable trees and the same push-down logic as other read
>>>>> paths: https://issues.apache.org/jira/browse/SPARK-23203
>>>>> * Don't allow users to supply schemas when they aren't supported:
>>>>> https://issues.apache.org/jira/browse/SPARK-23418
>>>>>
>>>>> I think it would make the 2.3.0 release more usable for anyone
>>>>> interested in the v2 read and write paths.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Tue, Feb 20, 2018 at 7:07 PM, Weichen Xu <weichen...@databricks.com
>>>>> > wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Wed, Feb 21, 2018 at 10:07 AM, Marcelo Vanzin <van...@cloudera.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Done, thanks!
>>>>>>>
>>>>>>> On Tue, Feb 20, 2018 at 6:05 PM, Sameer Agarwal <samee...@apache.org>
>>>>>>> wrote:
>>>>>>> > Sure, please feel free to backport.
>>>>>>> >
>>>>>>> > On 20 February 2018 at 18:02, Marcelo Vanzin <van...@cloudera.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Hey Sameer,
>>>>>>> >>
>>>>>>> >> Mind including https://github.com/apache/spark/pull/20643
>>>>>>> >> (SPARK-23468)  in the new RC? It's a minor bug since I've only
>>>>>>> hit it
>>>>>>> >> with older shuffle services, but it's pretty safe.
>>>>>>> >>
>>>>>>&

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-20 Thread Shixiong(Ryan) Zhu

I'm -1 because of the UI regression https://issues.apache.org/jira
/browse/SPARK-23470 : the All Jobs page may be too slow and cause "read
timeout" when there are lots of jobs and stages. This is one of the most
important pages because when it's broken, it's pretty hard to use Spark Web
UI.

On Tue, Feb 20, 2018 at 4:37 AM, Marco Gaido  wrote:

> +1
>
> 2018-02-20 12:30 GMT+01:00 Hyukjin Kwon :
>
>> +1 too
>>
>> 2018-02-20 14:41 GMT+09:00 Takuya UESHIN :
>>
>>> +1
>>>
>>>
>>> On Tue, Feb 20, 2018 at 2:14 PM, Xingbo Jiang 
>>> wrote:
>>>
 +1

 Wenchen Fan 于2018年2月20日 周二下午1:09写道：

> +1
>
> On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin 
> wrote:
>
>> +1
>>
>> On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal ,
>> wrote:
>>
>> this file shouldn't be included? https://dist.apache.org/repos/
>>> dist/dev/spark/v2.3.0-rc4-bin/spark-parent_2.11.iml
>>>
>>
>> I've now deleted this file
>>
>> *From:* Sameer Agarwal 
>>> *Sent:* Saturday, February 17, 2018 1:43:39 PM
>>> *To:* Sameer Agarwal
>>> *Cc:* dev
>>> *Subject:* Re: [VOTE] Spark 2.3.0 (RC4)
>>>
>>> I'll start with a +1 once again.
>>>
>>> All blockers reported against RC3 have been resolved and the builds
>>> are healthy.
>>>
>>> On 17 February 2018 at 13:41, Sameer Agarwal 
>>> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 2.3.0. The vote is open until Thursday February 22, 2018 at 
 8:00:00
 am UTC and passes if a majority of at least 3 PMC +1 votes are cast.

 [ ] +1 Release this package as Apache Spark 2.3.0

 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 https://spark.apache.org/

 The tag to be voted on is v2.3.0-rc4:
 https://github.com/apache/spark/tree/v2.3.0-rc4
 (44095cb65500739695b0324c177c19dfa1471472)

 List of JIRA tickets resolved in this release can be found here:
 https://issues.apache.org/jira/projects/SPARK/versions/12339551

 The release files, including signatures, digests, etc. can be found
 at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapache
 spark-1265/

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-docs
 /_site/index.html

 FAQ

 ===
 What are the unresolved issues targeted for 2.3.0?
 ===

 Please see https://s.apache.org/oXKi. At the time of writing,
 there are currently no known release blockers.

 =
 How can I help test this release?
 =

 If you are a Spark user, you can help us test this release by
 taking an existing Spark workload and running on this release 
 candidate,
 then reporting any regressions.

 If you're working in PySpark you can set up a virtual env and
 install the current RC and see if anything important breaks, in the
 Java/Scala you can add the staging repository to your projects 
 resolvers
 and test with the RC (make sure to clean up the artifact cache 
 before/after
 so you don't end up building with a out of date RC going forward).

 ===
 What should happen to JIRA tickets still targeting 2.3.0?
 ===

 Committers should look at those and triage. Extremely important bug
 fixes, documentation, and API tweaks that impact compatibility should 
 be
 worked on immediately. Everything else please retarget to 2.3.1 or 
 2.4.0 as
 appropriate.

 ===
 Why is my bug not fixed?
 ===

 In order to make timely releases, we will typically not hold the
 release unless the bug in question is a regression from 2.2.0. That 
 being
 said, if there is something which is a regression from 2.2.0 and has 
 not

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Shixiong(Ryan) Zhu

+ Jose

On Thu, Jan 25, 2018 at 2:18 PM, Dongjoon Hyun 
wrote:

> SPARK-23221 is one of the reasons for Kafka-test-suite deadlock issue.
>
> For the hang issues, it seems not to be marked as a failure correctly in
> Apache Spark Jenkins history.
>
>
> On Thu, Jan 25, 2018 at 1:03 PM, Marcelo Vanzin 
> wrote:
>
>> On Thu, Jan 25, 2018 at 12:29 PM, Sean Owen  wrote:
>> > I am still seeing these tests fail or hang:
>> >
>> > - subscribing topic by name from earliest offsets (failOnDataLoss:
>> false)
>> > - subscribing topic by name from earliest offsets (failOnDataLoss: true)
>>
>> This is something that we are seeing internally on a different version
>> Spark, and we're currently investigating with our Kafka people. Not
>> sure it's the same issue (we have a newer version of Kafka libraries),
>> but this is just another way of saying that I don't think those hangs
>> are new in 2.3, at least.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>

Re: Build timed out for `branch-2.3 (hadoop-2.7)`

2018-01-12 Thread Shixiong(Ryan) Zhu

FYI, we reverted a commit in
https://github.com/apache/spark/commit/55dbfbca37ce4c05f83180777ba3d4fe2d96a02e
to fix the issue.

On Fri, Jan 12, 2018 at 11:45 AM, Xin Lu  wrote:

> seems like someone should investigate what caused the build time to go up
> an hour and if it's expected or not.
>
> On Thu, Jan 11, 2018 at 7:37 PM, Dongjoon Hyun 
> wrote:
>
>> Hi, All and Shane.
>>
>> Can we increase the build time for `branch-2.3` during 2.3 RC period?
>>
>> There are two known test issues, but the Jenkins on branch-2.3 with
>> hadoop-2.7 fails with build timeout. So, it's difficult to monitor whether
>> the branch is healthy or not.
>>
>> Build timed out (after 255 minutes). Marking the build as aborted.
>> Build was aborted
>> ...
>> Finished: ABORTED
>>
>> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/60/console
>> - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Tes
>> t%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/47/console
>>
>> Bests,
>> Dongjoon.
>>
>
>

Re: [SQL] Why no numOutputRows metric for LocalTableScanExec in webUI?

2017-11-16 Thread Shixiong(Ryan) Zhu

SQL metrics are collected using SparkListener. If there are no
tasks, org.apache.spark.sql.execution.ui.SQLListener cannot collect any
metrics.

On Thu, Nov 16, 2017 at 1:53 AM, Jacek Laskowski  wrote:

> Hi,
>
> I seem to have figured out why the metric is not in the web UI for the
> query, but wish I knew how to explain it for any metric and operator.
>
> It seems that numOutputRows metric won't be displayed in web UI when a
> query uses no Spark jobs.
>
> val names = Seq("Jacek", "Agata").toDF("name")
>
> // no numOutputRows metric in web UI
> names.show
>
> // The query gives numOutputRows metric in web UI's Details for Query (SQL
> tab)
> scala> names.groupBy(length($"name")).count.show
>
> That must be somewhat generic and I think has nothing to do with
> LocalTableScanExec. Could anyone explain it in more detail? I'd appreciate.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> On Wed, Nov 15, 2017 at 10:14 PM, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I've been playing with LocalTableScanExec and noticed that it
>> defines numOutputRows metric, but I couldn't find it in the diagram in web
>> UI's Details for Query in SQL tab. Why?
>>
>> scala> spark.version
>> res1: String = 2.3.0-SNAPSHOT
>>
>> scala> val hello = udf { s: String => s"Hello $s" }
>> hello: org.apache.spark.sql.expressions.UserDefinedFunction =
>> UserDefinedFunction(,StringType,Some(List(StringType)))
>>
>> scala> Seq("Jacek").toDF("name").select(hello($"name")).show
>> +---+
>> |  UDF(name)|
>> +---+
>> |Hello Jacek|
>> +---+
>>
>> http://localhost:4040/SQL/execution/?id=0 shows no metrics for
>> LocalTableScan. Is this intended?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>
>

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Shixiong(Ryan) Zhu

+1

On Tue, Nov 7, 2017 at 1:34 PM, Joseph Bradley 
wrote:

> +1
>
> On Mon, Nov 6, 2017 at 5:11 PM, Michael Armbrust 
> wrote:
>
>> +1
>>
>> On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li  wrote:
>>
>>> +1
>>>
>>> 2017-11-04 11:00 GMT-07:00 Burak Yavuz :
>>>
 +1

 On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan 
 wrote:

> +1
>
> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu 
> wrote:
>
>> +1.
>>
>> On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>>
>>> +1 from me too.
>>>
>>> Matei
>>>
>>> > On Nov 3, 2017, at 4:59 PM, Wenchen Fan 
>>> wrote:
>>> >
>>> > +1.
>>> >
>>> > I think this architecture makes a lot of sense to let executors
>>> talk to source/sink directly, and bring very low latency.
>>> >
>>> > On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen 
>>> wrote:
>>> > +0 simply because I don't feel I know enough to have an opinion. I
>>> have no reason to doubt the change though, from a skim through the doc.
>>> >
>>> >
>>> > On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin 
>>> wrote:
>>> > Earlier I sent out a discussion thread for CP in Structured
>>> Streaming:
>>> >
>>> > https://issues.apache.org/jira/browse/SPARK-20928
>>> >
>>> > It is meant to be a very small, surgical change to Structured
>>> Streaming to enable ultra-low latency. This is great timing because we 
>>> are
>>> also designing and implementing data source API v2. If designed 
>>> properly,
>>> we can have the same data source API working for both streaming and 
>>> batch.
>>> >
>>> >
>>> > Following the SPIP process, I'm putting this SPIP up for a vote.
>>> >
>>> > +1: Let's go ahead and design / implement the SPIP.
>>> > +0: Don't really care.
>>> > -1: I do not think this is a good idea for the following reasons.
>>> >
>>> >
>>> >
>>>
>>>
>>> 
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783 <(224)%20436-0783>
> Greater Chicago
>


>>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>

Re: [SS] Why does StreamingQueryManager.notifyQueryTermination use id and runId (not just id)?

2017-10-27 Thread Shixiong(Ryan) Zhu

stateStoreCoordinator uses runId to deal with a small chance that Spark
cannot turn a bad task down. Please see
https://github.com/apache/spark/pull/18355

On Fri, Oct 27, 2017 at 3:40 AM, Jacek Laskowski  wrote:

> Hi,
>
> I'm wondering why StreamingQueryManager.notifyQueryTermination [1] use a
> query id to remove it from the activeQueries internal registry [2] while
> notifies stateStoreCoordinator using runId [3]?
>
> My understanding is that id is the same across different runs of a query
> so once StreamingQueryManager removes the query (by its id) it effectively
> knows nothing about the query yet stateStoreCoordinator may have other
> instances running (since we only deactivated a single run).
>
> Why is the "inconsistency"?
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#
> L325
>
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#
> L327
>
> [3] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#
> L335
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>

Re: What is d3kbcqa49mib13.cloudfront.net ?

2017-09-15 Thread Shixiong(Ryan) Zhu

Can we just create those tables once locally using official Spark versions
and commit them? Then the unit tests can just read these files and don't
need to download Spark.

On Thu, Sep 14, 2017 at 8:13 AM, Sean Owen  wrote:

> I think the download could use the Apache mirror, yeah. I don't know if
> there's a reason that it must though. What's good enough for releases is
> good enough for this purpose. People might not like the big download in the
> tests if it really came up as an issue we could find ways to cache it
> better locally. I brought it up more as a question than a problem to solve.
>
> On Thu, Sep 14, 2017 at 5:02 PM Mark Hamstra 
> wrote:
>
>> The problem is that it's not really an "official" download link, but
>> rather just a supplemental convenience. While that may be ok when
>> distributing artifacts, it's more of a problem when actually building and
>> testing artifacts. In the latter case, the download should really only be
>> from an Apache mirror.
>>
>> On Thu, Sep 14, 2017 at 1:20 AM, Wenchen Fan  wrote:
>>
>>> That test case is trying to test the backward compatibility of
>>> `HiveExternalCatalog`. It downloads official Spark releases and creates
>>> tables with them, and then read these tables via the current Spark.
>>>
>>> About the download link, I just picked it from the Spark website, and
>>> this link is the default one when you choose "direct download". Do we have
>>> a better choice?
>>>
>>> On Thu, Sep 14, 2017 at 3:05 AM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
 Mark, I agree with your point on the risks of using Cloudfront while
 building Spark. I was only trying to provide background on when we
 started using Cloudfront.

 Personally, I don't have enough about context about the test case in
 question (e.g. Why are we downloading Spark in a test case ?).

 Thanks
 Shivaram

 On Wed, Sep 13, 2017 at 11:50 AM, Mark Hamstra 
 wrote:
 > Yeah, but that discussion and use case is a bit different --
 providing a
 > different route to download the final released and approved artifacts
 that
 > were built using only acceptable artifacts and sources vs. building
 and
 > checking prior to release using something that is not from an Apache
 mirror.
 > This new use case puts us in the position of approving spark
 artifacts that
 > weren't built entirely from canonical resources located in presumably
 secure
 > and monitored repositories. Incorporating something that is not
 completely
 > trusted or approved into the process of building something that we
 are then
 > going to approve as trusted is different from the prior use of
 cloudfront.
 >
 > On Wed, Sep 13, 2017 at 10:26 AM, Shivaram Venkataraman
 >  wrote:
 >>
 >> The bucket comes from Cloudfront, a CDN thats part of AWS. There was
 a
 >> bunch of discussion about this back in 2013
 >>
 >> https://lists.apache.org/thread.html/9a72ff7ce913dd85a6b112b1b2de53
 6dcda74b28b050f70646aba0ac@1380147885@%3Cdev.spark.apache.org%3E
 >>
 >> Shivaram
 >>
 >> On Wed, Sep 13, 2017 at 9:30 AM, Sean Owen 
 wrote:
 >> > Not a big deal, but Mark noticed that this test now downloads Spark
 >> > artifacts from the same 'direct download' link available on the
 >> > downloads
 >> > page:
 >> >
 >> >
 >> > https://github.com/apache/spark/blob/master/sql/hive/
 src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSui
 te.scala#L53
 >> >
 >> > https://d3kbcqa49mib13.cloudfront.net/spark-$version-
 bin-hadoop2.7.tgz
 >> >
 >> > I don't know of any particular problem with this, which is a
 parallel
 >> > download option in addition to the Apache mirrors. It's also the
 >> > default.
 >> >
 >> > Does anyone know what this bucket is and if there's a strong
 reason we
 >> > can't
 >> > just use mirrors?
 >>
 >> 
 -
 >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
 >>
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>>
>>

Re: SQLListener concurrency bug?

2017-06-26 Thread Shixiong(Ryan) Zhu

Right now they are safe because the caller also calls synchronized when
using them. This is to avoid copying objects. It's probably a bad design.
If you want to refactor them, PR is welcome.

On Mon, Jun 26, 2017 at 2:27 AM, Oleksandr Vayda 
wrote:

> Hi all,
>
> Reading the source code of the org.apache.spark.sql.execution.ui.
> SQLListener, specifically this place - https://github.com/apache/
> spark/blob/master/sql/core/src/main/scala/org/apache/
> spark/sql/execution/ui/SQLListener.scala#L328
>
> def getFailedExecutions: Seq[SQLExecutionUIData] = synchronized {
> failedExecutions
> }
> def getCompletedExecutions: Seq[SQLExecutionUIData] = synchronized {
> completedExecutions
> }
> I believe the synchronized block is used here incorrectly. If I get it
> right the main purpose here is to synchronize access to the mutable
> collections from the UI (read) and the event bus (read/write) threads. But
> in the current implementation the "synchronized" blocks return bare
> references to mutable collections and in fact nothing gets synchronized.
> Is it a bug?
>
> Sincerely yours,
> Oleksandr Vayda
>
> mobile: +420 604 113 056 <+420%20604%20113%20056>
>

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-26 Thread Shixiong(Ryan) Zhu

Hey Assaf,

You need to "v2.2.0" to "v2.2.0-rc5" in GitHub links because there is no
v2.2.0 right now.

On Mon, Jun 26, 2017 at 12:57 AM, assaf.mendelson 
wrote:

> Not a show stopper, however, I was looking at the structured streaming
> programming guide and under arbitrary stateful operations (
> https://people.apache.org/~pwendell/spark-releases/spark-
> 2.2.0-rc5-docs/structured-streaming-programming-guide.
> html#arbitrary-stateful-operations) the suggestion is to take a look at
> the examples (Scala
> 
> /Java
> 
> ). These link to an non existing file (called StructuredSessionization or
> JavaStructuredSessionization, I couldn’t find either of these files in the
> repository).
>
> If the example file exists, I think it would be nice to add it, otherwise
> I would suggest simply removing the examples link from the programming
> guide (there are examples inside the group state API
> https://people.apache.org/~pwendell/spark-releases/spark-
> 2.2.0-rc5-docs/api/scala/index.html#org.apache.spark.
> sql.streaming.GroupState).
>
>
>
> Thanks,
>
>   Assaf.
>
>
>
> *From:* Michael Armbrust [via Apache Spark Developers List] [mailto:ml+[hidden
> email] ]
> *Sent:* Wednesday, June 21, 2017 2:50 AM
> *To:* Mendelson, Assaf
> *Subject:* [VOTE] Apache Spark 2.2.0 (RC5)
>
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.2.0. The vote is open until Friday, June 23rd, 2017 at 18:00 PST and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
>
>
> [ ] +1 Release this package as Apache Spark 2.2.0
>
> [ ] -1 Do not release this package because ...
>
>
>
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
>
>
> The tag to be voted on is v2.2.0-rc5
>  (62e442e73a2fa66
> 3892d2edaff5f7d72d7f402ed)
>
>
>
> List of JIRA tickets resolved can be found with this filter
> 
> .
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-bin/
>
>
>
> Release artifacts are signed with the following key:
>
> https://people.apache.org/keys/committer/pwendell.asc
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1243/
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc5-docs/
>
>
>
>
>
> *FAQ*
>
>
>
> *How can I help test this release?*
>
>
>
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
>
> *What should happen to JIRA tickets still targeting 2.2.0?*
>
>
>
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>
>
>
> *But my bug isn't fixed!??!*
>
>
>
> In order to make timely releases, we will typically not hold the release
> unless the bug in question is a regression from 2.1.1.
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/VOTE-Apache-Spark-2-2-0-RC5-tp21815.html
>
> To start a new topic under Apache Spark Developers List, email [hidden
> email] 
> To unsubscribe from Apache Spark Developers List, click here.
> NAML
> 
>
> --
> View this message in context: RE: [VOTE] Apache Spark 2.2.0 (RC5)
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>

Re: structured streaming documentation does not match behavior

2017-06-16 Thread Shixiong(Ryan) Zhu

I created https://issues.apache.org/jira/browse/SPARK-21123. PR is welcome.

On Thu, Jun 15, 2017 at 10:55 AM, Shixiong(Ryan) Zhu <
shixi...@databricks.com> wrote:

> Good catch. These are file source options. Could you submit a PR to fix
> the doc? Thanks!
>
> On Thu, Jun 15, 2017 at 10:46 AM, Mendelson, Assaf <
> assaf.mendel...@rsa.com> wrote:
>
>> Hi,
>>
>> I have started to play around with structured streaming and it seems the
>> documentation (structured streaming programming guide) does not match the
>> actual behavior I am seeing.
>>
>> It says in the documentation that maxFilesPerTrigger (as well as
>> latestFirst) are options for the File sink. However, in fact, at least
>> maxFilesPerTrigger does not seem to have any real effect. On the other
>> hand, the streaming source (readStream) which has no documentation for this
>> option, does limit the number of files.
>>
>> This behavior actually makes more sense than the documentation as I
>> expect the file reader to define how to read files rather than the sink
>> (e.g. if I would use a kafka sink or foreach sink, they should still get
>> the same behavior from the reading).
>>
>>
>>
>> Thanks,
>>
>>   Assaf.
>>
>>
>>
>
>

Re: structured streaming documentation does not match behavior

2017-06-15 Thread Shixiong(Ryan) Zhu

Good catch. These are file source options. Could you submit a PR to fix the
doc? Thanks!

On Thu, Jun 15, 2017 at 10:46 AM, Mendelson, Assaf 
wrote:

> Hi,
>
> I have started to play around with structured streaming and it seems the
> documentation (structured streaming programming guide) does not match the
> actual behavior I am seeing.
>
> It says in the documentation that maxFilesPerTrigger (as well as
> latestFirst) are options for the File sink. However, in fact, at least
> maxFilesPerTrigger does not seem to have any real effect. On the other
> hand, the streaming source (readStream) which has no documentation for this
> option, does limit the number of files.
>
> This behavior actually makes more sense than the documentation as I expect
> the file reader to define how to read files rather than the sink (e.g. if I
> would use a kafka sink or foreach sink, they should still get the same
> behavior from the reading).
>
>
>
> Thanks,
>
>   Assaf.
>
>
>

Re: Can I use ChannelTrafficShapingHandler to control the network read/write speed in shuffle?

2017-06-13 Thread Shixiong(Ryan) Zhu

I took a look at ChannelTrafficShapingHandler. Looks like it's because it
doesn't support FileRegion. Spark's messages use this interface.
See org.apache.spark.network.protocol.MessageWithHeader.

On Tue, Jun 13, 2017 at 4:17 AM, Niu Zhaojie  wrote:

> Hi All:
>
> I am trying to control the network read/write speed with
> ChannelTrafficShapingHandler provided by Netty.
>
>
> In TransportContext.java
>
> I modify it as below:
>
> public TransportChannelHandler initializePipeline(
> SocketChannel channel,
> RpcHandler channelRpcHandler) {
>   try {
> // added by zhaojie
> logger.info("want to try control read bandwidth on host: " + host);
> final ChannelTrafficShapingHandler channelShaping = new 
> ChannelTrafficShapingHandler(50, 50, 1000);
>
> TransportChannelHandler channelHandler = createChannelHandler(channel, 
> channelRpcHandler);
>
> channel.pipeline()
> .addLast("encoder", ENCODER)
> .addLast(TransportFrameDecoder.HANDLER_NAME, 
> NettyUtils.createFrameDecoder())
> .addLast("decoder", DECODER)
> .addLast("channelTrafficShaping", channelShaping)
> .addLast("idleStateHandler", new IdleStateHandler(0, 0, 
> conf.connectionTimeoutMs() / 1000))
> // NOTE: Chunks are currently guaranteed to be returned in the 
> order of request, but this
> // would require more logic to guarantee if this were not part of 
> the same event loop.
> .addLast("handler", channelHandler);
>
>
> I create a ChannelTrafficShapingHandler and register it into the pipeline
> of the channel. I set the write and read speed as 50kb/sec in the
> constructor.
> Except for it, what else do I need to do?
>
> However, it does not work. Is this idea correct? Am I missing something?
> Is there any better way ?
>
> Thanks.
>
> --
> *Regards,*
> *Zhaojie*
>
>

Re: Question about upgrading Kafka client version

2017-03-10 Thread Shixiong(Ryan) Zhu

I did some investigation yesterday and just posted my finds in the ticket.
Please read my latest comment in https://issues.apache.org/
jira/browse/SPARK-18057

On Fri, Mar 10, 2017 at 11:41 AM, Cody Koeninger  wrote:

> There are existing tickets on the issues around kafka versions, e.g.
> https://issues.apache.org/jira/browse/SPARK-18057 that haven't gotten
> any committer weigh-in on direction.
>
> On Thu, Mar 9, 2017 at 12:52 PM, Oscar Batori 
> wrote:
> > Guys,
> >
> > To change the subject from meta-voting...
> >
> > We are doing Spark Streaming against a Kafka setup, everything is pretty
> > standard, and pretty current. In particular we are using Spark 2.1, and
> > Kafka 0.10.1, with batch windows that are quite large (5-10 minutes). The
> > problem we are having is pretty well described in the following excerpt
> from
> > the Spark documentation:
> > "For possible kafkaParams, see Kafka consumer config docs. If your Spark
> > batch duration is larger than the default Kafka heartbeat session timeout
> > (30 seconds), increase heartbeat.interval.ms and session.timeout.ms
> > appropriately. For batches larger than 5 minutes, this will require
> changing
> > group.max.session.timeout.ms on the broker. Note that the example sets
> > enable.auto.commit to false, for discussion see Storing Offsets below."
> >
> > In our case "group.max.session.timeout.ms" is set to default value, and
> our
> > processing time per batch easily exceeds that value. I did some further
> > hunting around and found the following SO post:
> > "KIP-62, decouples heartbeats from calls to poll() via a background
> > heartbeat thread. This, allow for a longer processing time (ie, time
> between
> > two consecutive poll()) than heartbeat interval."
> >
> > This pretty accurately describes our scenario: effectively our per batch
> > processing time is 2-6 minutes, well within the batch window, but in
> excess
> > of the max session timeout between polls, causing the consumer to be
> kicked
> > out of the group.
> >
> > Are there any plans to move the Kafka client up to 0.10.1 and make this
> > feature available to consumers? Or have I missed some helpful
> configuration
> > that would ameliorate this problem? I recognize changing
> > "group.max.session.timeout.ms" is one solution, though it seems doing
> > heartbeat checking outside of implicitly piggy backing on polling seems
> more
> > elegant.
> >
> > -Oscar
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: PSA: Java 8 unidoc build

2017-02-07 Thread Shixiong(Ryan) Zhu

@Sean, I'm using Java 8 but don't see these errors until I manually build
the API docs. Hence I think dropping Java 7 support may not help.

Right now we don't build docs in most of builds as building docs takes a
long time (e.g.,
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/2889/ says
it's 1.5 hours).

On Tue, Feb 7, 2017 at 4:06 AM, Sean Owen  wrote:

> I believe that if we ran the Jenkins builds with Java 8 we would catch
> these? this doesn't require dropping Java 7 support or anything.
>
> @joshrosen I know we are just now talking about modifying the Jenkins jobs
> to remove old Hadoop configs. Is it possible to change the master jobs to
> use Java 8? can't hurt really in any event.
>
> Or maybe I'm mistaken and they already run Java 8 and it doesn't catch
> this until Java 8 is the target.
>
> Yeah this is going to keep breaking as javadoc 8 is pretty strict. Thanks
> Hyukjin. It has forced us to clean up a lot of sloppy bits of doc though.
>
>
> On Tue, Feb 7, 2017 at 12:13 AM Joseph Bradley 
> wrote:
>
>> Public service announcement: Our doc build has worked with Java 8 for
>> brief time periods, but new changes keep breaking the Java 8 unidoc build.
>> Please be aware of this, and try to test doc changes with Java 8!  In
>> general, it is stricter than Java 7 for docs.
>>
>> A shout out to @HyukjinKwon and others who have made many fixes for
>> this!  See these sample PRs for some issues causing failures (especially
>> around links):
>> https://github.com/apache/spark/pull/16741
>> https://github.com/apache/spark/pull/16604
>>
>> Thanks,
>> Joseph
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>

Re: Structured Streaming Source error

2017-01-31 Thread Shixiong(Ryan) Zhu

You used one Spark version to compile your codes but another newer version
to run. As the Source APIs are not stable, Spark doesn't guarantee that
they are binary compatibility.

On Tue, Jan 31, 2017 at 1:39 PM, Sam Elamin  wrote:

> Hi Folks
>
>
> I am getting a weird error when trying to write a BigQuery Structured
> Streaming source
>
>
> Error:
> java.lang.AbstractMethodError: com.samelamin.spark.bigquery.
> streaming.BigQuerySource.commit(Lorg/apache/spark/sql/
> execution/streaming/Offset;)V
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$constructNextBatch$1$$anonfun$
> apply$mcV$sp$5.apply(StreamExecution.scala:359)
> at org.apache.spark.sql.execution.streaming.
> StreamExecution$$anonfun$org$apache$spark$sql$execution$
> streaming$StreamExecution$$constructNextBatch$1$$anonfun$
> apply$mcV$sp$5.apply(StreamExecution.scala:358)
>
>
> FYI if you are interested in Spark and BigQuery feel free to give my
> connector a go! Still trying to get structured streaming to it as a source
> hence this email. but you can use it as a sink!
>
>
> Regards
> Sam
>
>
>
>

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Shixiong(Ryan) Zhu

Congrats Burak & Holden!

On Tue, Jan 24, 2017 at 10:39 AM, Joseph Bradley 
wrote:

> Congratulations Burak & Holden!
>
> On Tue, Jan 24, 2017 at 10:33 AM, Dongjoon Hyun 
> wrote:
>
>> Great! Congratulations, Burak and Holden.
>>
>> Bests,
>> Dongjoon.
>>
>> On 2017-01-24 10:29 (-0800), Nicholas Chammas 
>> wrote:
>> >  
>> >
>> > Congratulations, Burak and Holden.
>> >
>> > On Tue, Jan 24, 2017 at 1:27 PM Russell Spitzer <
>> russell.spit...@gmail.com>
>> > wrote:
>> >
>> > > Great news! Congratulations!
>> > >
>> > > On Tue, Jan 24, 2017 at 10:25 AM Dean Wampler 
>> > > wrote:
>> > >
>> > > Congratulations to both of you!
>> > >
>> > > dean
>> > >
>> > > *Dean Wampler, Ph.D.*
>> > > Author: Programming Scala, 2nd Edition
>> > > , Fast Data
>> > > Architectures for Streaming Applications
>> > > > r-streaming-applications.csp>,
>> > > Functional Programming for Java Developers
>> > > , and Programming
>> Hive
>> > >  (O'Reilly)
>> > > Lightbend 
>> > > @deanwampler 
>> > > http://polyglotprogramming.com
>> > > https://github.com/deanwampler
>> > >
>> > > On Tue, Jan 24, 2017 at 6:14 PM, Xiao Li 
>> wrote:
>> > >
>> > > Congratulations! Burak and Holden!
>> > >
>> > > 2017-01-24 10:13 GMT-08:00 Reynold Xin :
>> > >
>> > > Hi all,
>> > >
>> > > Burak and Holden have recently been elected as Apache Spark
>> committers.
>> > >
>> > > Burak has been very active in a large number of areas in Spark,
>> including
>> > > linear algebra, stats/maths functions in DataFrames, Python/R APIs for
>> > > DataFrames, dstream, and most recently Structured Streaming.
>> > >
>> > > Holden has been a long time Spark contributor and evangelist. She has
>> > > written a few books on Spark, as well as frequent contributions to the
>> > > Python API to improve its usability and performance.
>> > >
>> > > Please join me in welcoming the two!
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu

Could you post your codes, please?

On Wed, Jan 11, 2017 at 3:53 PM, Kalvin Chau <kalvinnc...@gmail.com> wrote:

> "spark.speculation" is not set, so it would be whatever the default is.
>
>
> On Wed, Jan 11, 2017 at 3:43 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Or do you enable "spark.speculation"? If not, Spark Streaming should not
>> launch two tasks using the same TopicPartition.
>>
>> On Wed, Jan 11, 2017 at 3:33 PM, Kalvin Chau <kalvinnc...@gmail.com>
>> wrote:
>>
>> I have not modified that configuration setting, and that doesn't seem to
>> be documented anywhere.
>>
>> Does the Kafka 0.10 require the number of cores on an executor be set to
>> 1? I didn't see that documented anywhere either.
>>
>> On Wed, Jan 11, 2017 at 3:27 PM Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>> Do you change "spark.streaming.concurrentJobs" to more than 1? Kafka
>> 0.10 connector requires it must be 1.
>>
>> On Wed, Jan 11, 2017 at 3:25 PM, Kalvin Chau <kalvinnc...@gmail.com>
>> wrote:
>>
>> I'm not re-using any InputDStreams actually, this is one InputDStream
>> that has a window applied to it.
>>  Then when Spark creates and assigns tasks to read from the Topic, one
>> executor gets assigned two tasks to read from the same TopicPartition, and
>> uses the same CachedKafkaConsumer to read from the TopicPartition causing
>> the ConcurrentModificationException in one of the worker threads.
>>
>> On Wed, Jan 11, 2017 at 2:53 PM Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>> I think you may reuse the kafka DStream (the DStream returned by
>> createDirectStream). If you need to read from the same Kafka source, you
>> need to create another DStream.
>>
>> On Wed, Jan 11, 2017 at 2:38 PM, Kalvin Chau <kalvinnc...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> We've been running into ConcurrentModificationExcpetions "KafkaConsumer
>> is not safe for multi-threaded access" with the CachedKafkaConsumer. I've
>> been working through debugging this issue and after looking through some of
>> the spark source code I think this is a bug.
>>
>> Our set up is:
>> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using
>> Spark-Streaming-Kafka-010
>> spark.executor.cores 1
>> spark.mesos.extra.cores 1
>>
>> Batch interval: 10s, window interval: 180s, and slide interval: 30s
>>
>> We would see the exception when in one executor there are two task worker
>> threads assigned the same Topic+Partition, but a different set of offsets.
>>
>> They would both get the same CachedKafkaConsumer, and whichever task
>> thread went first would seek and poll for all the records, and at the same
>> time the second thread would try to seek to its offset but fail because it
>> is unable to acquire the lock.
>>
>> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
>> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
>>
>> Time1 E0 Task0 - Seeks and starts to poll
>> Time1 E0 Task1 - Attempts to seek, but fails
>>
>> Here are some relevant logs:
>> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing
>> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing
>> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
>> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG
>> CachedKafkaConsumer: Get spark-executor-consumer test-topic 2 nextOffset
>> 4394204414 requested 4394204414
>> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG
>> CachedKafkaConsumer: Get spark-executor-consumer test-topic 2 nextOffset
>> 4394204414 requested 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer:
>> Initial fetch for spark-executor-consumer test-topic 2 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG
>> CachedKafkaConsumer: Seeking to test-topic-2 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager:
>> Putting block rdd_199_2 failed due to an exception
>> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block
>> rdd_199_2 could not be removed as it was not found on disk or in memory
>> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception
>> in task 49.0 in stage 45.0 (TID 3201)
>> java.util.ConcurrentModificationException: KafkaConsumer

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu

Or do you enable "spark.speculation"? If not, Spark Streaming should not
launch two tasks using the same TopicPartition.

On Wed, Jan 11, 2017 at 3:33 PM, Kalvin Chau <kalvinnc...@gmail.com> wrote:

> I have not modified that configuration setting, and that doesn't seem to
> be documented anywhere.
>
> Does the Kafka 0.10 require the number of cores on an executor be set to
> 1? I didn't see that documented anywhere either.
>
> On Wed, Jan 11, 2017 at 3:27 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Do you change "spark.streaming.concurrentJobs" to more than 1? Kafka
>> 0.10 connector requires it must be 1.
>>
>> On Wed, Jan 11, 2017 at 3:25 PM, Kalvin Chau <kalvinnc...@gmail.com>
>> wrote:
>>
>> I'm not re-using any InputDStreams actually, this is one InputDStream
>> that has a window applied to it.
>>  Then when Spark creates and assigns tasks to read from the Topic, one
>> executor gets assigned two tasks to read from the same TopicPartition, and
>> uses the same CachedKafkaConsumer to read from the TopicPartition causing
>> the ConcurrentModificationException in one of the worker threads.
>>
>> On Wed, Jan 11, 2017 at 2:53 PM Shixiong(Ryan) Zhu <
>> shixi...@databricks.com> wrote:
>>
>> I think you may reuse the kafka DStream (the DStream returned by
>> createDirectStream). If you need to read from the same Kafka source, you
>> need to create another DStream.
>>
>> On Wed, Jan 11, 2017 at 2:38 PM, Kalvin Chau <kalvinnc...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> We've been running into ConcurrentModificationExcpetions "KafkaConsumer
>> is not safe for multi-threaded access" with the CachedKafkaConsumer. I've
>> been working through debugging this issue and after looking through some of
>> the spark source code I think this is a bug.
>>
>> Our set up is:
>> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using
>> Spark-Streaming-Kafka-010
>> spark.executor.cores 1
>> spark.mesos.extra.cores 1
>>
>> Batch interval: 10s, window interval: 180s, and slide interval: 30s
>>
>> We would see the exception when in one executor there are two task worker
>> threads assigned the same Topic+Partition, but a different set of offsets.
>>
>> They would both get the same CachedKafkaConsumer, and whichever task
>> thread went first would seek and poll for all the records, and at the same
>> time the second thread would try to seek to its offset but fail because it
>> is unable to acquire the lock.
>>
>> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
>> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
>>
>> Time1 E0 Task0 - Seeks and starts to poll
>> Time1 E0 Task1 - Attempts to seek, but fails
>>
>> Here are some relevant logs:
>> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing
>> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing
>> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
>> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG
>> CachedKafkaConsumer: Get spark-executor-consumer test-topic 2 nextOffset
>> 4394204414 requested 4394204414
>> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG
>> CachedKafkaConsumer: Get spark-executor-consumer test-topic 2 nextOffset
>> 4394204414 requested 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer:
>> Initial fetch for spark-executor-consumer test-topic 2 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG
>> CachedKafkaConsumer: Seeking to test-topic-2 4394238058
>> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager:
>> Putting block rdd_199_2 failed due to an exception
>> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block
>> rdd_199_2 could not be removed as it was not found on disk or in memory
>> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception
>> in task 49.0 in stage 45.0 (TID 3201)
>> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
>> multi-threaded access
>> at org.apache.kafka.clients.consumer.KafkaConsumer.
>> acquire(KafkaConsumer.java:1431)
>> at org.apache.kafka.clients.consumer.KafkaConsumer.seek(
>> KafkaConsumer.java:1132)
>> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.
>> seek(CachedKafkaConsumer.scala:95)
>> at org.apache.spark.

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu

Do you change "spark.streaming.concurrentJobs" to more than 1? Kafka 0.10
connector requires it must be 1.

On Wed, Jan 11, 2017 at 3:25 PM, Kalvin Chau <kalvinnc...@gmail.com> wrote:

> I'm not re-using any InputDStreams actually, this is one InputDStream that
> has a window applied to it.
>  Then when Spark creates and assigns tasks to read from the Topic, one
> executor gets assigned two tasks to read from the same TopicPartition, and
> uses the same CachedKafkaConsumer to read from the TopicPartition causing
> the ConcurrentModificationException in one of the worker threads.
>
> On Wed, Jan 11, 2017 at 2:53 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
> I think you may reuse the kafka DStream (the DStream returned by
> createDirectStream). If you need to read from the same Kafka source, you
> need to create another DStream.
>
> On Wed, Jan 11, 2017 at 2:38 PM, Kalvin Chau <kalvinnc...@gmail.com>
> wrote:
>
> Hi,
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer
> is not safe for multi-threaded access" with the CachedKafkaConsumer. I've
> been working through debugging this issue and after looking through some of
> the spark source code I think this is a bug.
>
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
>
> We would see the exception when in one executor there are two task worker
> threads assigned the same Topic+Partition, but a different set of offsets.
>
> They would both get the same CachedKafkaConsumer, and whichever task
> thread went first would seek and poll for all the records, and at the same
> time the second thread would try to seek to its offset but fail because it
> is unable to acquire the lock.
>
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
>
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
>
> Here are some relevant logs:
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer:
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer:
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer:
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer:
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception
> in task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
> multi-threaded access
> at org.apache.kafka.clients.consumer.KafkaConsumer.
> acquire(KafkaConsumer.java:1431)
> at org.apache.kafka.clients.consumer.KafkaConsumer.seek(
> KafkaConsumer.java:1132)
> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.
> seek(CachedKafkaConsumer.scala:95)
> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.
> get(CachedKafkaConsumer.scala:69)
> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(
> KafkaRDD.scala:227)
> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(
> KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(
> MemoryStore.scala:360)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:951)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:926)
> at org.a

Re: [Streaming] ConcurrentModificationExceptions when Windowing

2017-01-11 Thread Shixiong(Ryan) Zhu

I think you may reuse the kafka DStream (the DStream returned by
createDirectStream). If you need to read from the same Kafka source, you
need to create another DStream.

On Wed, Jan 11, 2017 at 2:38 PM, Kalvin Chau  wrote:

> Hi,
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer
> is not safe for multi-threaded access" with the CachedKafkaConsumer. I've
> been working through debugging this issue and after looking through some of
> the spark source code I think this is a bug.
>
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
>
> We would see the exception when in one executor there are two task worker
> threads assigned the same Topic+Partition, but a different set of offsets.
>
> They would both get the same CachedKafkaConsumer, and whichever task
> thread went first would seek and poll for all the records, and at the same
> time the second thread would try to seek to its offset but fail because it
> is unable to acquire the lock.
>
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
>
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
>
> Here are some relevant logs:
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer:
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer:
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer:
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer:
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception
> in task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
> multi-threaded access
> at org.apache.kafka.clients.consumer.KafkaConsumer.
> acquire(KafkaConsumer.java:1431)
> at org.apache.kafka.clients.consumer.KafkaConsumer.seek(
> KafkaConsumer.java:1132)
> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.
> seek(CachedKafkaConsumer.scala:95)
> at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.
> get(CachedKafkaConsumer.scala:69)
> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(
> KafkaRDD.scala:227)
> at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(
> KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(
> MemoryStore.scala:360)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:951)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:926)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
> at org.apache.spark.storage.BlockManager.doPutIterator(
> BlockManager.scala:926)
> at org.apache.spark.storage.BlockManager.getOrElseUpdate(
> BlockManager.scala:670)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:79)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at

Re: Spark structured steaming from kafka - last message processed again after resume from checkpoint

2016-12-25 Thread Shixiong(Ryan) Zhu

Hi Niek,

That's expected. Just answered on stackoverflow.

On Sun, Dec 25, 2016 at 8:07 AM, Niek  wrote:

> Hi,
>
> I described my issue in full detail on
> http://stackoverflow.com/questions/41300223/spark-
> structured-steaming-from-kafka-last-message-processed-again-after-resume
>
> Any idea what's going wrong?
>
> Looking at the code base on
> https://github.com/apache/spark/blob/3f62e1b5d9e75dc07bac3aa4db3e8d
> 0615cc3cc3/sql/core/src/main/scala/org/apache/spark/sql/
> execution/streaming/StreamExecution.scala#L290,
> I don't understand why you are resuming with an already committed offset
> (the one from currrentBatchId - 1)
>
> Thanks,
>
> Niek.
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-structured-
> steaming-from-kafka-last-message-processed-again-after-
> resume-from-checkpoint-tp20353.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Kafka Spark structured streaming latency benchmark.

2016-12-19 Thread Shixiong(Ryan) Zhu

Hey Prashant. Thanks for your codes. I did some investigation and it turned
out that ContextCleaner is too slow and its "referenceQueue" keeps growing.
My hunch is cleaning broadcast is very slow since it's a blocking call.

On Mon, Dec 19, 2016 at 12:50 PM, Shixiong(Ryan) Zhu <
shixi...@databricks.com> wrote:

> Hey, Prashant. Could you track the GC root of byte arrays in the heap?
>
> On Sat, Dec 17, 2016 at 10:04 PM, Prashant Sharma <scrapco...@gmail.com>
> wrote:
>
>> Furthermore, I ran the same thing with 26 GB as the memory, which would
>> mean 1.3GB per thread of memory. My jmap
>> <https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jmap-histo>
>> results and jstat
>> <https://github.com/ScrapCodes/KafkaProducer/blob/master/data/26GB/t11_jstat>
>> results collected after running the job for more than 11h, again show a
>> memory constraint. The same gradual slowdown, but a bit more gradual as
>> memory is considerably more than the previous runs.
>>
>>
>>
>>
>> This situation sounds like a memory leak ? As the byte array objects are
>> more than 13GB, and are not garbage collected.
>>
>> --Prashant
>>
>>
>> On Sun, Dec 18, 2016 at 8:49 AM, Prashant Sharma <scrapco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Goal of my benchmark is to arrive at end to end latency lower than 100ms
>>> and sustain them over time, by consuming from a kafka topic and writing
>>> back to another kafka topic using Spark. Since the job does not do
>>> aggregation and does a constant time processing on each message, it
>>> appeared to me as an achievable target. But, then there are some surprising
>>> and interesting pattern to observe.
>>>
>>>  Basically, it has four components namely,
>>> 1) kafka
>>> 2) Long running kafka producer, rate limited to 1000 msgs/sec, with each
>>> message of about 1KB.
>>> 3) Spark  job subscribed to `test` topic and writes out to another topic
>>> `output`.
>>> 4) A Kafka consumer, reading from the `output` topic.
>>>
>>> How the latency was measured ?
>>>
>>> While sending messages from kafka producer, each message is embedded the
>>> timestamp at which it is pushed to the kafka `test` topic. Spark receives
>>> each message and writes them out to `output` topic as is. When these
>>> messages arrive at Kafka consumer, their embedded time is subtracted from
>>> the time of arrival at the consumer and a scatter plot of the same is
>>> attached.
>>>
>>> The scatter plots sample only 10 minutes of data received during initial
>>> one hour and then again 10 minutes of data received after 2 hours of run.
>>>
>>>
>>>
>>> These plots indicate a significant slowdown in latency, in the later
>>> scatter plot indicate almost all the messages were received with a delay
>>> larger than 2 seconds. However, first plot show that most messages arrived
>>> in less than 100ms latency. The two samples were taken with time difference
>>> of 2 hours approx.
>>>
>>> After running the test for 24 hours, the jstat
>>> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jstat_output.txt>
>>> and jmap
>>> <https://raw.githubusercontent.com/ScrapCodes/KafkaProducer/master/data/jmap_output.txt>
>>>  output
>>> for the jobs indicate possibility  of memory constrains. To be more clear,
>>> job was run with local[20] and memory of 5GB(spark.driver.memory). The job
>>> is straight forward and located here: https://github.com/ScrapCodes/
>>> KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
>>> es/kafka/SparkSQLKafkaConsumer.scala .
>>>
>>>
>>> What is causing the gradual slowdown? I need help in diagnosing the
>>> problem.
>>>
>>> Thanks,
>>>
>>> --Prashant
>>>
>>>
>>
>

Re: Kafka Spark structured streaming latency benchmark.

2016-12-19 Thread Shixiong(Ryan) Zhu

Hey, Prashant. Could you track the GC root of byte arrays in the heap?

On Sat, Dec 17, 2016 at 10:04 PM, Prashant Sharma 
wrote:

> Furthermore, I ran the same thing with 26 GB as the memory, which would
> mean 1.3GB per thread of memory. My jmap
> 
> results and jstat
> 
> results collected after running the job for more than 11h, again show a
> memory constraint. The same gradual slowdown, but a bit more gradual as
> memory is considerably more than the previous runs.
>
>
>
>
> This situation sounds like a memory leak ? As the byte array objects are
> more than 13GB, and are not garbage collected.
>
> --Prashant
>
>
> On Sun, Dec 18, 2016 at 8:49 AM, Prashant Sharma 
> wrote:
>
>> Hi,
>>
>> Goal of my benchmark is to arrive at end to end latency lower than 100ms
>> and sustain them over time, by consuming from a kafka topic and writing
>> back to another kafka topic using Spark. Since the job does not do
>> aggregation and does a constant time processing on each message, it
>> appeared to me as an achievable target. But, then there are some surprising
>> and interesting pattern to observe.
>>
>>  Basically, it has four components namely,
>> 1) kafka
>> 2) Long running kafka producer, rate limited to 1000 msgs/sec, with each
>> message of about 1KB.
>> 3) Spark  job subscribed to `test` topic and writes out to another topic
>> `output`.
>> 4) A Kafka consumer, reading from the `output` topic.
>>
>> How the latency was measured ?
>>
>> While sending messages from kafka producer, each message is embedded the
>> timestamp at which it is pushed to the kafka `test` topic. Spark receives
>> each message and writes them out to `output` topic as is. When these
>> messages arrive at Kafka consumer, their embedded time is subtracted from
>> the time of arrival at the consumer and a scatter plot of the same is
>> attached.
>>
>> The scatter plots sample only 10 minutes of data received during initial
>> one hour and then again 10 minutes of data received after 2 hours of run.
>>
>>
>>
>> These plots indicate a significant slowdown in latency, in the later
>> scatter plot indicate almost all the messages were received with a delay
>> larger than 2 seconds. However, first plot show that most messages arrived
>> in less than 100ms latency. The two samples were taken with time difference
>> of 2 hours approx.
>>
>> After running the test for 24 hours, the jstat
>> 
>> and jmap
>> 
>>  output
>> for the jobs indicate possibility  of memory constrains. To be more clear,
>> job was run with local[20] and memory of 5GB(spark.driver.memory). The job
>> is straight forward and located here: https://github.com/ScrapCodes/
>> KafkaProducer/blob/master/src/main/scala/com/github/scrapcod
>> es/kafka/SparkSQLKafkaConsumer.scala .
>>
>>
>> What is causing the gradual slowdown? I need help in diagnosing the
>> problem.
>>
>> Thanks,
>>
>> --Prashant
>>
>>
>

Re: [VOTE] Apache Spark 2.1.0 (RC2)

2016-12-09 Thread Shixiong(Ryan) Zhu

Sean, "stress test for failOnDataLoss=false" is because Kafka consumer may
be thrown NPE when a topic is deleted. I added some logic to retry on such
failure, however, it may still fail when topic deletion is too frequent
(the stress test). Just reopened
https://issues.apache.org/jira/browse/SPARK-18588.

Anyway, this is just a best effort to deal with Kafka issue, and in
practice, people won't delete topic frequently, so this is not a release
blocker.

On Fri, Dec 9, 2016 at 2:55 AM, Sean Owen  wrote:

> As usual, the sigs / hashes are fine and licenses look fine.
>
> I am still seeing some test failures. A few I've seen over time and aren't
> repeatable, but a few seem persistent. ANyone else observed these? I'm on
> Ubuntu 16 / Java 8 building for -Pyarn -Phadoop-2.7 -Phive
>
> If anyone can confirm I'll investigate the cause if I can. I'd hesitate to
> support the release yet unless the build is definitely passing for others.
>
>
> udf3Test(test.org.apache.spark.sql.JavaUDFSuite)  Time elapsed: 0.281 sec
>  <<< ERROR!
> java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.
> JavaTypeInference$.inferDataType(Lcom/google/common/reflect/TypeToken;)
> Lscala/Tuple2;
> at test.org.apache.spark.sql.JavaUDFSuite.udf3Test(JavaUDFSuite.java:107)
>
>
>
> - caching on disk *** FAILED ***
>   java.util.concurrent.TimeoutException: Can't find 2 executors before
> 3 milliseconds elapsed
>   at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(
> JobProgressListener.scala:584)
>   at org.apache.spark.DistributedSuite.org$apache$spark$DistributedSuite$$
> testCaching(DistributedSuite.scala:154)
>   at org.apache.spark.DistributedSuite$$anonfun$32$$
> anonfun$apply$1.apply$mcV$sp(DistributedSuite.scala:191)
>   at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(
> DistributedSuite.scala:191)
>   at org.apache.spark.DistributedSuite$$anonfun$32$$anonfun$apply$1.apply(
> DistributedSuite.scala:191)
>   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(
> Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   ...
>
>
> - stress test for failOnDataLoss=false *** FAILED ***
>   org.apache.spark.sql.streaming.StreamingQueryException: Query [id =
> 3b191b78-7f30-46d3-93f8-5fbeecce94a2, runId = 
> 0cab93b6-19d8-47a7-88ad-d296bea72405]
> terminated with exception: null
>   at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$
> spark$sql$execution$streaming$StreamExecution$$runBatches(
> StreamExecution.scala:262)
>   at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(
> StreamExecution.scala:160)
>   ...
>   Cause: java.lang.NullPointerException:
>   ...
>
>
>
> On Thu, Dec 8, 2016 at 4:40 PM Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.1.0. The vote is open until Sun, December 11, 2016 at 1:00 PT and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.1.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.1.0-rc2 (080717497365b83bc202ab16812ced
>> 93eb1ea7bd)
>>
>> List of JIRA tickets resolved are:  https://issues.apache.
>> org/jira/issues/?jql=project%20%3D%20SPARK%20AND%
>> 20fixVersion%20%3D%202.1.0
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1217
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/
>>
>>
>> (Note that the docs and staging repo are still being uploaded and will be
>> available soon)
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.1.0?
>> ===
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to

Re: Difference between netty and netty-all

2016-12-05 Thread Shixiong(Ryan) Zhu

No. I meant only updating master. It's not worth to update a maintenance
branch unless there are critical issues.

On Mon, Dec 5, 2016 at 5:39 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> You mean just for branch-2.0, right?
> 
>
> On Mon, Dec 5, 2016 at 8:35 PM Shixiong(Ryan) Zhu <shixi...@databricks.com>
> wrote:
>
>> Hey Nick,
>>
>> It should be safe to upgrade Netty to the latest 4.0.x version. Could you
>> submit a PR, please?
>>
>> On Mon, Dec 5, 2016 at 11:47 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> That file is in Netty 4.0.29, but I believe the PR I referenced is not.
>> It's only in Netty 4.0.37 and up.
>>
>>
>> On Mon, Dec 5, 2016 at 1:57 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> This should be in netty-all :
>>
>> $ jar tvf 
>> /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
>> | grep ThreadLocalRandom
>>967 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$1.class
>>   1079 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$2.class
>>   5973 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom.class
>>
>> On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> I’m looking at the list of dependencies here:
>>
>> https://github.com/apache/spark/search?l=Groff=netty;
>> type=Code=%E2%9C%93
>>
>> What’s the difference between netty and netty-all?
>>
>> The reason I ask is because I’m looking at a Netty PR
>> <https://github.com/netty/netty/pull/5345> and trying to figure out if
>> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>>
>> Nick
>> 
>>
>>
>>
>>

Re: Difference between netty and netty-all

2016-12-05 Thread Shixiong(Ryan) Zhu

Hey Nick,

It should be safe to upgrade Netty to the latest 4.0.x version. Could you
submit a PR, please?

On Mon, Dec 5, 2016 at 11:47 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> That file is in Netty 4.0.29, but I believe the PR I referenced is not.
> It's only in Netty 4.0.37 and up.
>
>
> On Mon, Dec 5, 2016 at 1:57 PM Ted Yu  wrote:
>
>> This should be in netty-all :
>>
>> $ jar tvf 
>> /home/x/.m2/repository/io/netty/netty-all/4.0.29.Final/netty-all-4.0.29.Final.jar
>> | grep ThreadLocalRandom
>>967 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$1.class
>>   1079 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom$2.class
>>   5973 Tue Jun 23 11:10:30 UTC 2015 io/netty/util/internal/
>> ThreadLocalRandom.class
>>
>> On Mon, Dec 5, 2016 at 8:53 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> I’m looking at the list of dependencies here:
>>
>> https://github.com/apache/spark/search?l=Groff=netty;
>> type=Code=%E2%9C%93
>>
>> What’s the difference between netty and netty-all?
>>
>> The reason I ask is because I’m looking at a Netty PR
>>  and trying to figure out if
>> Spark 2.0.2 is using a version of Netty that includes that PR or not.
>>
>> Nick
>> 
>>
>>
>>

Re: Can I add a new method to RDD class?

2016-12-05 Thread Shixiong(Ryan) Zhu

RDD.sparkContext is public:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@sparkContext:org.apache.spark.SparkContext

On Mon, Dec 5, 2016 at 1:04 PM, Teng Long  wrote:

> Thank you for providing another answer, Holden.
>
> So I did what Tarun and Michal suggested, and it didn’t work out as I want
> to have a new transformation method in RDD class, and need to use that
> RDD’s spark context which is private. So I guess the only thing I can do
> now is to sbt publishLocal?
>
> On Dec 5, 2016, at 9:19 AM, Holden Karau  wrote:
>
> Doing that requires publishing a custom version of Spark, you can edit the
> version number do do a publishLocal - but maintaining that change is going
> to be difficult. The other approaches suggested are probably better, but
> also does your method need to be defined on the RDD class? Could you
> instead make a helper object or class to expose whatever functionality you
> need?
>
> On Mon, Dec 5, 2016 at 6:06 PM long  wrote:
>
>> Thank you very much! But why can’t I just add new methods in to the
>> source code of RDD?
>>
>> On Dec 5, 2016, at 3:15 AM, Michal Šenkýř [via Apache Spark Developers
>> List] <[hidden email]
>> > wrote:
>>
>> A simple Scala example of implicit classes:
>>
>> implicit class EnhancedString(str: String) {
>>   def prefix(prefix: String) = prefix + str
>> }
>>
>> println("World".prefix("Hello "))
>>
>> As Tarun said, you have to import it if it's not in the same class where
>> you use it.
>>
>> Hope this makes it clearer,
>>
>> Michal Senkyr
>>
>> On 5.12.2016 07:43, Tarun Kumar wrote:
>>
>> Not sure if that's documented in terms of Spark but this is a fairly
>> common pattern in scala known as "pimp my library" pattern, you can easily
>> find many generic example of using this pattern. If you want I can quickly
>> cook up a short conplete example with rdd(although there is nothing really
>> more to my example in earlier mail) ? Thanks Tarun Kumar
>>
>> On Mon, 5 Dec 2016 at 7:15 AM, long <> rel="nofollow" link="external" class="">[hidden email]> wrote:
>>
>> So is there documentation of this I can refer to?
>>
>> On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers
>> List] <[hidden email]
>> > wrote:
>>
>> Hi Tenglong, In addition to trsell's reply, you can add any method to an
>> rdd without making changes to spark code. This can be achieved by using
>> implicit class in your own client code: implicit class extendRDD[T](rdd:
>> RDD[T]){ def foo() } Then you basically nees to import this implicit class
>> in scope where you want to use the new foo method. Thanks Tarun Kumar
>>
>> On Mon, 5 Dec 2016 at 6:59 AM, <> SendEmail.jtp?type=nodeamp;node=20102amp;i=0" class="">
>> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=0"
>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>>
>> How does your application fetch the spark dependency? Perhaps list your
>> project dependencies and check it's using your dev build.
>>
>> On Mon, 5 Dec 2016, 08:47 tenglong, <> SendEmail.jtp?type=nodeamp;node=20102amp;i=1" class="">
>> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=1"
>> target="_top" rel="nofollow" link="external" class="">[hidden email]> wrote:
>>
>> Hi,
>>
>> Apparently, I've already tried adding a new method to RDD,
>>
>> for example,
>>
>> class RDD {
>>   def foo() // this is the one I added
>>
>>   def map()
>>
>>   def collect()
>> }
>>
>> I can build Spark successfully, but I can't compile my application code
>> which calls rdd.foo(), and the error message says
>>
>> value foo is not a member of org.apache.spark.rdd.RDD[String]
>>
>> So I am wondering if there is any mechanism prevents me from doing this or
>> something I'm doing wrong?
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Can-I-add-a-new-
>> method-to-RDD-class-tp20100.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com .
>>
>> -
>>
>> To unsubscribe e-mail: > SendEmail.jtp?type=nodeamp;node=20102amp;i=2" class="">
>> x-msg://19/user/SendEmail.jtp?type=nodenode=20102i=2"
>> target="_top" rel="nofollow" link="external" class="">[hidden email]
>>
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-spark-developers-list.1001551.n3.
>> nabble.com/Can-I-add-a-new-method-to-RDD-class-tp20100p20102.html
>> To unsubscribe from Can I add a new method to RDD class?, click here.
>> NAML
>>

Re: [SparkStreaming] 1 SQL tab for each SparkStreaming batch in SparkUI

2016-11-22 Thread Shixiong(Ryan) Zhu

If you create a HiveContext before starting StreamingContext, then
`SQLContext.getOrCreate` in foreachRDD will return the HiveContext you
created. You can just call asInstanceOf[HiveContext] to convert it to
HiveContext.

On Tue, Nov 22, 2016 at 8:25 AM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hi Koert,
> Certainly it's not a good idea, I was trying to use SQLContext.getOrCreate
> but it will return a SQLContext and not a HiveContext.
> As I'm using a checkpoint, whenever I start the context by reading the
> checkpoint it didn't create my hive context, unless I create it foreach
> microbach.
> I didn't find a way to use the same hivecontext for all batches.
> Does anybody know where can I find how to do this?
>
>
>
>
> 2016-11-22 14:17 GMT-02:00 Koert Kuipers :
>
>> you are creating a new hive context per microbatch? is that a good idea?
>>
>> On Tue, Nov 22, 2016 at 8:51 AM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>>
>>> Has anybody seen this behavior (see tha attached picture) in Spark
>>> Streaming?
>>> It started to happen here after I changed the HiveContext creation to
>>> stream.foreachRDD {
>>> rdd =>
>>> val hiveContext = new HiveContext(rdd.sparkContext)
>>> }
>>>
>>> Is this expected?
>>>
>>> Kind Regards,
>>> Dirceu
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>
>>
>

Re: Running lint-java during PR builds?

2016-11-15 Thread Shixiong(Ryan) Zhu

I remember it's because you need to run `mvn install` before running
lint-java if the maven cache is empty, and `mvn install` is pretty heavy.

On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin  wrote:

> Hey all,
>
> Is there a reason why lint-java is not run during PR builds? I see it
> seems to be maven-only, is it really expensive to run after an sbt
> build?
>
> I see a lot of PRs coming in to fix Java style issues, and those all
> seem a little unnecessary. Either we're enforcing style checks or
> we're not, and right now it seems we aren't.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Shixiong(Ryan) Zhu

+1

On Tue, Nov 8, 2016 at 5:50 AM, Ricardo Almeida <
ricardo.alme...@actnowib.com> wrote:

> +1 (non-binding)
>
> over Ubuntu 16.10, Java 8 (OpenJDK 1.8.0_111) built with Hadoop 2.7.3,
> YARN, Hive
>
>
> On 8 November 2016 at 12:38, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1
>>
>> On Tue, Nov 8, 2016 at 7:09 AM, Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.2. The vote is open until Thu, Nov 10, 2016 at 22:00 PDT and passes if
>>> a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.2
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.2-rc3 (584354eaac02531c9584188b14336
>>> 7ba694b0c34)
>>>
>>> This release candidate resolves 84 issues:
>>> https://s.apache.org/spark-2.0.2-jira
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1214/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions from 2.0.1.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series. Bugs already
>>> present in 2.0.1, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.3, rather than 2.0.2. If a new RC
>>> (i.e. RC4) is cut, I will change the fix version of those patches to 2.0.2.
>>>
>>
>>
>

Re: Spark has a compile dependency on scalatest

2016-10-28 Thread Shixiong(Ryan) Zhu

This is my test pom:


4.0.0
  foo
bar
1.0



org.apache.spark
spark-core_2.10
2.0.1

 



scalatest is in the compile scope:

[INFO] bar:foo:jar:1.0
[INFO] \- org.apache.spark:spark-core_2.10:jar:2.0.1:compile
[INFO]+- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:compile
[INFO]|  +- org.apache.avro:avro-ipc:jar:1.7.7:compile
[INFO]|  |  \- org.apache.avro:avro:jar:1.7.7:compile
[INFO]|  +- org.apache.avro:avro-ipc:jar:tests:1.7.7:compile
[INFO]|  +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO]|  \- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO]+- com.twitter:chill_2.10:jar:0.8.0:compile
[INFO]|  \- com.esotericsoftware:kryo-shaded:jar:3.0.3:compile
[INFO]| +- com.esotericsoftware:minlog:jar:1.3.0:compile
[INFO]| \- org.objenesis:objenesis:jar:2.1:compile
[INFO]+- com.twitter:chill-java:jar:0.8.0:compile
[INFO]+- org.apache.xbean:xbean-asm5-shaded:jar:4.4:compile
[INFO]+- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO]|  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
[INFO]|  |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO]|  |  +- org.apache.commons:commons-math:jar:2.1:compile
[INFO]|  |  +- xmlenc:xmlenc:jar:0.52:compile
[INFO]|  |  +- commons-io:commons-io:jar:2.1:compile
[INFO]|  |  +- commons-lang:commons-lang:jar:2.5:compile
[INFO]|  |  +-
commons-configuration:commons-configuration:jar:1.6:compile
[INFO]|  |  |  +-
commons-collections:commons-collections:jar:3.2.1:compile
[INFO]|  |  |  +- commons-digester:commons-digester:jar:1.8:compile
[INFO]|  |  |  |  \-
commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO]|  |  |  \-
commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO]|  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO]|  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
[INFO]|  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO]|  | \- org.tukaani:xz:jar:1.0:compile
[INFO]|  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
[INFO]|  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO]|  +-
org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
[INFO]|  |  +-
org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
[INFO]|  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
[INFO]|  |  |  |  \- com.google.inject:guice:jar:3.0:compile
[INFO]|  |  |  | +- javax.inject:javax.inject:jar:1:compile
[INFO]|  |  |  | \- aopalliance:aopalliance:jar:1.0:compile
[INFO]|  |  |  \-
org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
[INFO]|  |  \-
org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
[INFO]|  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
[INFO]|  +-
org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
[INFO]|  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
[INFO]|  +-
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
[INFO]|  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
[INFO]+- org.apache.spark:spark-launcher_2.10:jar:2.0.1:compile
[INFO]+- org.apache.spark:spark-network-common_2.10:jar:2.0.1:compile
[INFO]|  +- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile
[INFO]|  \-
com.fasterxml.jackson.core:jackson-annotations:jar:2.6.5:compile
[INFO]+- org.apache.spark:spark-network-shuffle_2.10:jar:2.0.1:compile
[INFO]+- org.apache.spark:spark-unsafe_2.10:jar:2.0.1:compile
[INFO]+- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO]|  +- commons-codec:commons-codec:jar:1.3:compile
[INFO]|  \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO]+- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO]|  +- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO]|  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
[INFO]|  +- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO]|  \- com.google.guava:guava:jar:14.0.1:compile
[INFO]+- javax.servlet:javax.servlet-api:jar:3.1.0:compile
[INFO]+- org.apache.commons:commons-lang3:jar:3.3.2:compile
[INFO]+- org.apache.commons:commons-math3:jar:3.4.1:compile
[INFO]+- com.google.code.findbugs:jsr305:jar:1.3.9:compile
[INFO]+- org.slf4j:slf4j-api:jar:1.7.16:compile
[INFO]+- org.slf4j:jul-to-slf4j:jar:1.7.16:compile
[INFO]+- org.slf4j:jcl-over-slf4j:jar:1.7.16:compile
[INFO]+- log4j:log4j:jar:1.2.17:compile
[INFO]+- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
[INFO]+- com.ning:compress-lzf:jar:1.0.3:compile
[INFO]+- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
[INFO]+- net.jpountz.lz4:lz4:jar:1.3.0:compile
[INFO]+- org.roaringbitmap:RoaringBitmap:jar:0.5.11:compile
[INFO]+- commons-net:commons-net:jar:2.2:compile
[INFO]+-

Re: Spark has a compile dependency on scalatest

2016-10-28 Thread Shixiong(Ryan) Zhu

You can just exclude scalatest from Spark.

On Fri, Oct 28, 2016 at 12:51 PM, Jeremy Smith 
wrote:

> spark-core depends on spark-launcher (compile)
> spark-launcher depends on spark-tags (compile)
> spark-tags depends on scalatest (compile)
>
> To be honest I'm not all that familiar with the project structure - should
> I just exclude spark-launcher if I'm not using it?
>
> On Fri, Oct 28, 2016 at 12:27 PM, Sean Owen  wrote:
>
>> It's required because the tags module uses it to define annotations for
>> tests. I don't see it in compile scope for anything but the tags module,
>> which is then in test scope for other modules. What are you seeing that
>> makes you say it's in compile scope?
>>
>> On Fri, Oct 28, 2016 at 8:19 PM Jeremy Smith 
>> wrote:
>>
>>> Hey everybody,
>>>
>>> Just a heads up that currently Spark 2.0.1 has a compile dependency on
>>> Scalatest 2.2.6. It comes from spark-core's dependency on spark-launcher,
>>> which has a transitive dependency on spark-tags, which has a compile
>>> dependency on Scalatest.
>>>
>>> This makes it impossible to use any other version of Scalatest for
>>> testing your app if you declare a dependency on any Spark 2.0.1 module;
>>> you'll get a bunch of runtime errors during testing (unless you figure out
>>> the reason like I did and explicitly exclude Scalatest from the spark
>>> dependency).
>>>
>>> I think that dependency should probably be moved to a test dependency
>>> instead.
>>>
>>> Thanks,
>>> Jeremy
>>>
>>
>

Re: Spark has a compile dependency on scalatest

2016-10-28 Thread Shixiong(Ryan) Zhu

spark-tags is in the compile scope of spark-core...

On Fri, Oct 28, 2016 at 12:27 PM, Sean Owen  wrote:

> It's required because the tags module uses it to define annotations for
> tests. I don't see it in compile scope for anything but the tags module,
> which is then in test scope for other modules. What are you seeing that
> makes you say it's in compile scope?
>
> On Fri, Oct 28, 2016 at 8:19 PM Jeremy Smith 
> wrote:
>
>> Hey everybody,
>>
>> Just a heads up that currently Spark 2.0.1 has a compile dependency on
>> Scalatest 2.2.6. It comes from spark-core's dependency on spark-launcher,
>> which has a transitive dependency on spark-tags, which has a compile
>> dependency on Scalatest.
>>
>> This makes it impossible to use any other version of Scalatest for
>> testing your app if you declare a dependency on any Spark 2.0.1 module;
>> you'll get a bunch of runtime errors during testing (unless you figure out
>> the reason like I did and explicitly exclude Scalatest from the spark
>> dependency).
>>
>> I think that dependency should probably be moved to a test dependency
>> instead.
>>
>> Thanks,
>> Jeremy
>>
>

Re: This Exception has been really hard to trace

2016-10-10 Thread Shixiong(Ryan) Zhu

Seems the runtime Spark is different from the compiled one. You should
mark the Spark components  "provided". See
https://issues.apache.org/jira/browse/SPARK-9219

On Sun, Oct 9, 2016 at 8:13 PM, kant kodali  wrote:

>
> I tried SpanBy but look like there is a strange error that happening no
> matter which way I try. Like the one here described for Java solution.
>
> http://qaoverflow.com/question/how-to-use-spanby-in-java/
>
>
> *java.lang.ClassCastException: cannot assign instance of
> scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org
> $apache$spark$rdd$RDD$$dependencies_
> of type scala.collection.Seq in instance of
> org.apache.spark.rdd.MapPartitionsRDD*
>
>
> JavaPairRDD cassandraRowsRDD=
> javaFunctions(sc).cassandraTable("test", "hello" )
>.select("col1", "col2", "col3" )
>.spanBy(new Function() {
> @Override
> public ByteBuffer call(CassandraRow v1) {
> return v1.getBytes("rowkey");
> }
> }, ByteBuffer.class);
>
>
> And then here I do this here is where the problem occurs
>
> List> listOftuples =
> cassandraRowsRDD.collect(); // ERROR OCCURS HERE
> Tuple2 tuple =
> listOftuples.iterator().next();
> ByteBuffer partitionKey = tuple._1();
> for(CassandraRow cassandraRow: tuple._2()) {
> System.out.println(cassandraRow.getLong("col1"));
> }
>
> so I tried this  and same error
>
> Iterable> listOftuples =
> cassandraRowsRDD.collect(); // ERROR OCCURS HERE
> Tuple2 tuple =
> listOftuples.iterator().next();
> ByteBuffer partitionKey = tuple._1();
> for(CassandraRow cassandraRow: tuple._2()) {
> System.out.println(cassandraRow.getLong("col1"));
> }
>
> Although I understand that ByteBuffers aren't serializable I didn't get
> any not serializable exception but still I went head and *changed
> everything to byte[] so no more ByteBuffers in the code.*
>
> I have also tried cassandraRowsRDD.collect().forEach() and
> cassandraRowsRDD.stream().forEachPartition() and the same exact error
> occurs.
>
> I am running everything locally and in a stand alone mode so my spark
> cluster is just running on localhost.
>
> Scala code runner version 2.11.8  // when I run scala -version or even
> ./spark-shell
>
>
> compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
> compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version:
> '2.0.0'
> compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
> compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
> version: '2.0.0-M3':
>
>
> So I don't see anything wrong with these versions.
>
> 2) I am bundling everything into one jar and so far it did worked out well
> except for this error.
> I am using Java 8 and Gradle.
>
>
> any ideas on how I can fix this?
>

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Shixiong(Ryan) Zhu

Congrats!

On Tue, Oct 4, 2016 at 9:09 AM, Yanbo Liang  wrote:

> Congrats and welcome!
>
> On Tue, Oct 4, 2016 at 9:01 AM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> Congratulations Xiao! Very well deserved!
>>
>> On Mon, Oct 3, 2016 at 10:46 PM, Reynold Xin  wrote:
>>
>>> Hi all,
>>>
>>> Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
>>> committer. Xiao has been a super active contributor to Spark SQL. Congrats
>>> and welcome, Xiao!
>>>
>>> - Reynold
>>>
>>>
>>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-30 Thread Shixiong(Ryan) Zhu

Hey Mark,

I can reproduce the failure locally using your command. There were a lot of
OutOfMemoryError in the unit test log. I increased the heap size from 3g to
4g at https://github.com/apache/spark/blob/v2.0.1-rc4/pom.xml#L2029 and it
passed tests. I think the patch you mentioned increased the memory
usage of BlockManagerSuite
and made the tests easy to OOM. It can be fixed by mocking SparkContext (or
may be not necessary since Jenkins's maven and sbt builds are green now).

However, since this is only a test issue, it should not be a blocker.


On Fri, Sep 30, 2016 at 8:34 AM, Mark Hamstra 
wrote:

> 0
>
> RC4 is causing a build regression for me on at least one of my machines.
> RC3 built and ran tests successfully, but the tests consistently fail with
> RC4 unless I revert 9e91a1009e6f916245b4d4018de1664ea3decfe7,
> "[SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size
> configurable (branch 2.0)".  This is using build/mvn -U -Pyarn -Phadoop-2.7
> -Pkinesis-asl -Phive -Phive-thriftserver -Dpyspark -Dsparkr -DskipTests
> clean package; build/mvn -U -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive
> -Phive-thriftserver -Dpyspark -Dsparkr test.  Environment is macOS 10.12,
> Java 1.8.0_102.
>
> There are no tests that go red.  Rather, the core tests just end after...
>
> ...
> BlockManagerSuite:
> ...
> - overly large block
> - block compression
> - block store put failure
>
> ...with only the generic "[ERROR] Failed to execute goal
> org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
> spark-core_2.11: There are test failures".
>
> I'll try some other environments today to see whether I can turn this 0
> into either a -1 or +1, but right now that commit is looking deeply
> suspicious to me.
>
> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
>> majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.0.1
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v2.0.1-rc4 (933d2c1ea4e5f5c4ec8d375b5ccaa
>> 4577ba4be38)
>>
>> This release candidate resolves 301 issues:
>> https://s.apache.org/spark-2.0.1-jira
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1203/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/
>>
>>
>> Q: How can I help test this release?
>> A: If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions from 2.0.0.
>>
>> Q: What justifies a -1 vote for this release?
>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>> present in 2.0.0, missing features, or bugs related to new features will
>> not necessarily block this release.
>>
>> Q: What fix version should I use for patches merging into branch-2.0 from
>> now on?
>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
>> (i.e. RC5) is cut, I will change the fix version of those patches to 2.0.1.
>>
>>
>>
>

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Shixiong(Ryan) Zhu

＋1

On Sun, Sep 25, 2016 at 10:43 PM, Pete Lee  wrote:

> +1
>
>
> On Sun, Sep 25, 2016 at 3:26 PM, Herman van Hövell tot Westerflier <
> hvanhov...@databricks.com> wrote:
>
>> +1 (non-binding)
>>
>> On Sun, Sep 25, 2016 at 2:05 PM, Ricardo Almeida <
>> ricardo.alme...@actnowib.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Built and tested on
>>> - Ubuntu 16.04 / OpenJDK 1.8.0_91
>>> - CentOS / Oracle Java 1.7.0_55
>>> (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn)
>>>
>>>
>>> On 25 September 2016 at 22:35, Matei Zaharia 
>>> wrote:
>>>
 +1

 Matei

 On Sep 25, 2016, at 1:25 PM, Josh Rosen 
 wrote:

 +1

 On Sun, Sep 25, 2016 at 1:16 PM Yin Huai  wrote:

> +1
>
> On Sun, Sep 25, 2016 at 11:40 AM, Dongjoon Hyun 
> wrote:
>
>> +1 (non binding)
>>
>> RC3 is compiled and tested on the following two systems, too. All
>> tests passed.
>>
>> * CentOS 7.2 / Oracle JDK 1.8.0_77 / R 3.3.1
>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>> -Dsparkr
>> * CentOS 7.2 / Open JDK 1.8.0_102
>>with -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver
>>
>> Cheers,
>> Dongjoon
>>
>>
>>
>> On Saturday, September 24, 2016, Reynold Xin 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.0.1. The vote is open until Tue, Sep 27, 2016 at 15:30 PDT and
>>> passes if a majority of at least 3+1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> The tag to be voted on is v2.0.1-rc3 (9d28cc10357a8afcfb2fa2e6eecb5
>>> c2cc2730d17)
>>>
>>> This release candidate resolves 290 issues:
>>> https://s.apache.org/spark-2.0.1-jira
>>>
>>> The release files, including signatures, digests, etc. can be found
>>> at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>>> 1-rc3-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapache
>>> spark-1201/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.
>>> 1-rc3-docs/
>>>
>>>
>>> Q: How can I help test this release?
>>> A: If you are a Spark user, you can help us test this release by
>>> taking an existing Spark workload and running on this release candidate,
>>> then reporting any regressions from 2.0.0.
>>>
>>> Q: What justifies a -1 vote for this release?
>>> A: This is a maintenance release in the 2.0.x series.  Bugs already
>>> present in 2.0.0, missing features, or bugs related to new features will
>>> not necessarily block this release.
>>>
>>> Q: What fix version should I use for patches merging into branch-2.0
>>> from now on?
>>> A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new
>>> RC (i.e. RC4) is cut, I will change the fix version of those patches to
>>> 2.0.1.
>>>
>>>
>>>
>

>>>
>>
>

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Shixiong(Ryan) Zhu

Hey Pete,

I just pushed your PR to branch 1.6. As it's not a blocker, it may or may
not be in 1.6.2, depending on if there will be another RC.

On Tue, Jun 21, 2016 at 1:36 PM, Pete Robbins <robbin...@gmail.com> wrote:

> It breaks Spark running on machines with less than 3 cores/threads, which
> may be rare, and it is maybe an edge case.
>
> Personally, I like to fix known bugs and the fact there are other blocking
> methods in event loops actually makes it worse not to fix ones that you
> know about.
>
> Probably not a blocker to release though but that's your call.
>
> Cheers,
>
> On Tue, Jun 21, 2016 at 6:40 PM Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Hey Pete,
>>
>> I didn't backport it to 1.6 because it just affects tests in most cases.
>> I'm sure we also have other places calling blocking methods in the event
>> loops, so similar issues are still there even after applying this patch.
>> Hence, I don't think it's a blocker for 1.6.2.
>>
>> On Tue, Jun 21, 2016 at 2:57 AM, Pete Robbins <robbin...@gmail.com>
>> wrote:
>>
>>> The PR (https://github.com/apache/spark/pull/13055) to fix
>>> https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2
>>> however this fix caused another issue
>>> https://issues.apache.org/jira/browse/SPARK-15606 the fix for which (
>>> https://github.com/apache/spark/pull/13355) has not been backported to
>>> the 1.6 branch so I'm now seeing the same failure in 1.6.2
>>>
>>> Cheers,
>>>
>>> On Mon, 20 Jun 2016 at 05:25 Reynold Xin <r...@databricks.com> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT
>>>> and passes if a majority of at least 3+1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 1.6.2
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> The tag to be voted on is v1.6.2-rc2
>>>> (54b1121f351f056d6b67d2bb4efe0d553c0f7482)
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1186/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/
>>>>
>>>>
>>>> ===
>>>> == How can I help test this release? ==
>>>> ===
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions from 1.6.1.
>>>>
>>>> 
>>>> == What justifies a -1 vote for this release? ==
>>>> 
>>>> This is a maintenance release in the 1.6.x series.  Bugs already
>>>> present in 1.6.1, missing features, or bugs related to new features will
>>>> not necessarily block this release.
>>>>
>>>>
>>>>
>>>>
>>

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Shixiong(Ryan) Zhu

Hey Pete,

I didn't backport it to 1.6 because it just affects tests in most cases.
I'm sure we also have other places calling blocking methods in the event
loops, so similar issues are still there even after applying this patch.
Hence, I don't think it's a blocker for 1.6.2.

On Tue, Jun 21, 2016 at 2:57 AM, Pete Robbins  wrote:

> The PR (https://github.com/apache/spark/pull/13055) to fix
> https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2
> however this fix caused another issue
> https://issues.apache.org/jira/browse/SPARK-15606 the fix for which (
> https://github.com/apache/spark/pull/13355) has not been backported to
> the 1.6 branch so I'm now seeing the same failure in 1.6.2
>
> Cheers,
>
> On Mon, 20 Jun 2016 at 05:25 Reynold Xin  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.2. The vote is open until Wednesday, June 22, 2016 at 22:00 PDT and
>> passes if a majority of at least 3+1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.2
>> [ ] -1 Do not release this package because ...
>>
>>
>> The tag to be voted on is v1.6.2-rc2
>> (54b1121f351f056d6b67d2bb4efe0d553c0f7482)
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1186/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.2-rc2-docs/
>>
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions from 1.6.1.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This is a maintenance release in the 1.6.x series.  Bugs already present
>> in 1.6.1, missing features, or bugs related to new features will not
>> necessarily block this release.
>>
>>
>>
>>

Re: Welcoming Yanbo Liang as a committer

2016-06-05 Thread Shixiong(Ryan) Zhu

Congrats, Yanbo!

On Sun, Jun 5, 2016 at 6:25 PM, Liwei Lin  wrote:

> Congratulations Yanbo!
>
> On Mon, Jun 6, 2016 at 7:07 AM, Bryan Cutler  wrote:
>
>> Congratulations Yanbo!
>> On Jun 5, 2016 4:03 AM, "Kousuke Saruta" 
>> wrote:
>>
>>> Congratulations Yanbo!
>>>
>>>
>>> - Kousuke
>>>
>>> On 2016/06/04 11:48, Matei Zaharia wrote:
>>>
 Hi all,

 The PMC recently voted to add Yanbo Liang as a committer. Yanbo has
 been a super active contributor in many areas of MLlib. Please join me in
 welcoming Yanbo!

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>

Re: LiveListenerBus with started and stopped flags? Why both?

2016-05-26 Thread Shixiong(Ryan) Zhu

Just to prevent from restarting LiveListenerBus. The internal Thread cannot
be restarted.

On Wed, May 25, 2016 at 12:59 PM, Jacek Laskowski  wrote:

> Hi,
>
> I'm wondering why LiveListenerBus has two AtomicBoolean flags [1]?
> Could it not have just one, say started? Why does Spark have to check
> the stopped state?
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L49-L51
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: BUILD FAILURE due to...Unable to find configuration file at location dev/scalastyle-config.xml

2016-03-07 Thread Shixiong(Ryan) Zhu

There is a fix: https://github.com/apache/spark/pull/11567

On Mon, Mar 7, 2016 at 11:39 PM, Reynold Xin  wrote:

> +Sean, who was playing with this.
>
>
>
>
> On Mon, Mar 7, 2016 at 11:38 PM, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Got the BUILD FAILURE. Anyone looking into it?
>>
>> ➜  spark git:(master) ✗ ./build/mvn -Pyarn -Phadoop-2.6
>> -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver -DskipTests clean
>> install
>> ...
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time: 2.837 s
>> [INFO] Finished at: 2016-03-08T08:19:36+01:00
>> [INFO] Final Memory: 50M/581M
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> org.scalastyle:scalastyle-maven-plugin:0.8.0:check (default) on
>> project spark-parent_2.11: Failed during scalastyle execution: Unable
>> to find configuration file at location dev/scalastyle-config.xml ->
>> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with
>> the -e switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: PySpark, spill-related (possibly psutil) issue, throwing an exception '_fill_function() takes exactly 4 arguments (5 given)'

2016-03-06 Thread Shixiong(Ryan) Zhu

Could you rebuild the whole project? I changed the python function
serialization format in https://github.com/apache/spark/pull/11535 to fix a
bug. This exception looks like some place was still using the old codes.

On Sun, Mar 6, 2016 at 6:24 PM, Hyukjin Kwon  wrote:

> Just in case, My python version is 2.7.10.
>
> 2016-03-07 11:19 GMT+09:00 Hyukjin Kwon :
>
>> Hi all,
>>
>> While I am testing some codes in PySpark, I met a weird issue.
>>
>> This works fine at Spark 1.6.0 but it looks it does not for Spark 2.0.0.
>>
>> When I simply run *logData = sc.textFile(path).coalesce(1) *with some
>> big files in stand-alone local mode without HDFS, this simply throws the
>> exception,
>>
>>
>> *_fill_function() takes exactly 4 arguments (5 given)*
>>
>>
>> I firstly wanted to open a Jira for this but feel like it is a too
>> primitive functionality and I felt like I might be doing this wrong.
>>
>>
>>
>> The full error message is below:
>>
>> 16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/2016011900-20160215235900-TROI_STAT_ADE_0.DAT:2415919104+33554432
>> *16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/2016011900-20160215235900-TROI_STAT_ADE_0.DAT:805306368+33554432*
>> *16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/2016011900-20160215235900-TROI_STAT_ADE_0.DAT:0+33554432*
>> *16/03/07 11:12:44 INFO rdd.HadoopRDD: Input split:
>> file:/Users/hyukjinkwon/Desktop/workspace/local/spark-local-ade/spark/data/00_REF/2016011900-20160215235900-TROI_STAT_ADE_0.DAT:1610612736+33554432*
>> *16/03/07 11:12:44 ERROR executor.Executor: Exception in task 2.0 in
>> stage 0.0 (TID 2)*
>> *org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):*
>> *  File "./python/pyspark/worker.py", line 98, in main*
>> *command = pickleSer._read_with_length(infile)*
>> *  File "./python/pyspark/serializers.py", line 164, in _read_with_length*
>> *return self.loads(obj)*
>> *  File "./python/pyspark/serializers.py", line 422, in loads*
>> *return pickle.loads(obj)*
>> *TypeError: ('_fill_function() takes exactly 4 arguments (5 given)',
>> , (> 0x10612c488>, {'defaultdict': ,
>> 'get_used_memory': , 'pack_long':
>> }, None, {}, 'pyspark.rdd'))*
>>
>> * at
>> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:168)*
>> * at
>> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:209)*
>> * at
>> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:127)*
>> * at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:62)*
>> * at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)*
>> * at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)*
>> * at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:349)*
>> * at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)*
>> * at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)*
>> * at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)*
>> * at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)*
>> * at org.apache.spark.scheduler.Task.run(Task.scala:82)*
>> * at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)*
>> * at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
>> * at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
>> * at java.lang.Thread.run(Thread.java:745)*
>> *16/03/07 11:12:44 ERROR executor.Executor: Exception in task 3.0 in
>> stage 0.0 (TID 3)*
>> *org.apache.spark.api.python.PythonException: Traceback (most recent call
>> last):*
>> *  File "./python/pyspark/worker.py", line 98, in main*
>> *command = pickleSer._read_with_length(infile)*
>> *  File "./python/pyspark/serializers.py", line 164, in _read_with_length*
>> *return self.loads(obj)*
>> *  File "./python/pyspark/serializers.py", line 422, in loads*
>> *return pickle.loads(obj)*
>> *TypeError: ('_fill_function() takes exactly 4 arguments (5 given)',
>> , (> 0x10612c488>, {'defaultdict': ,
>> 'get_used_memory': , 'pack_long':
>> }, None, {}, 'pyspark.rdd'))*
>>
>>
>> Thanks!
>>
>
>

Re: getting a list of executors for use in getPreferredLocations

2016-03-03 Thread Shixiong(Ryan) Zhu

You can take a look at
"org.apache.spark.streaming.scheduler.ReceiverTracker#getExecutors"

On Thu, Mar 3, 2016 at 3:10 PM, Reynold Xin  wrote:

> What do you mean by consistent? Throughout the life cycle of an app, the
> executors can come and go and as a result really has no consistency. Do you
> just need it for a specific job?
>
>
>
> On Thu, Mar 3, 2016 at 3:08 PM, Cody Koeninger  wrote:
>
>> I need getPreferredLocations to choose a consistent executor for a
>> given partition in a stream.  In order to do that, I need to know what
>> the current executors are.
>>
>> I'm currently grabbing them from the block manager master .getPeers(),
>> which works, but I don't know if that's the most reasonable way to do
>> it.
>>
>> Relevant code:
>>
>>
>> https://github.com/koeninger/spark-1/blob/aaef0fc6e7e3aae18e4e03271bc0707d09d243e4/external/kafka-beta/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L107
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: Welcoming two new committers

2016-02-08 Thread Shixiong(Ryan) Zhu

Congrats!!! Herman and Wenchen!!!

On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende 
wrote:

>
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van Hovell
>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>> adding new features, optimizations and APIs. Please join me in welcoming
>> Herman and Wenchen.
>>
>> Matei
>>
>
> Congratulations !!!
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Data not getting printed in Spark Streaming with print().

2016-01-28 Thread Shixiong(Ryan) Zhu

fileStream has a parameter "newFilesOnly". By default, it's true and means
processing only new files and ignore existing files in the directory. So
you need to ***move*** the files into the directory, otherwise it will
ignore existing files.

You can also set "newFilesOnly" to false. Then in the first batch, it will
process all existing files.

On Thu, Jan 28, 2016 at 4:22 PM, satyajit vegesna <
satyajit.apas...@gmail.com> wrote:

> HI All,
>
> I am trying to run HdfsWordCount example from github.
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
>
> i am using ubuntu to run the program, but dont see any data getting
> printed after ,
> ---
> Time: 145402680 ms
> ---
>
> I dont see any errors, the program just runs, but i do not see any output
> of the data corresponding to the file used.
>
> object HdfsStream {
>
>   def main(args:Array[String]): Unit = {
>
> val sparkConf = new 
> SparkConf().setAppName("SpoolDirSpark").setMaster("local[5]")
> val ssc = new StreamingContext(sparkConf, Minutes(10))
>
> //val inputDirectory = "hdfs://localhost:9000/SpoolDirSpark"
> //val inputDirectory = "hdfs://localhost:9000/SpoolDirSpark/test.txt"
> val inputDirectory = "file:///home/satyajit/jsondata/"
>
> val lines = 
> ssc.fileStream[LongWritable,Text,TextInputFormat](inputDirectory).map{case(x,y)=>
>  (x.toString,y.toString)}
> //lines.saveAsTextFiles("hdfs://localhost:9000/SpoolDirSpark/datacheck")
> lines.saveAsTextFiles("file:///home/satyajit/jsondata/")
>
> println("check_data"+lines.print())
>
> ssc.start()
> ssc.awaitTermination()
>
> Would like to know if there is any workaround, or if there is something i
> am missing.
>
> Thanking in advance,
> Satyajit.
>

Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-29 Thread Shixiong(Ryan) Zhu

 Hi Jan, could you post your codes? I could not reproduce this issue in my
environment.

Best Regards,
Shixiong Zhu

2015-12-29 10:22 GMT-08:00 Shixiong Zhu :

> Could you create a JIRA? We can continue the discussion there. Thanks!
>
> Best Regards,
> Shixiong Zhu
>
> 2015-12-29 3:42 GMT-08:00 Jan Uyttenhove :
>
>> Hi guys,
>>
>> I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new
>> mapWithState API, after previously reporting issue SPARK-11932 (
>> https://issues.apache.org/jira/browse/SPARK-11932).
>>
>> My Spark streaming job involves reading data from a Kafka topic
>> (using KafkaUtils.createDirectStream), stateful processing (using
>> checkpointing & mapWithState) & publishing the results back to Kafka.
>>
>> I'm now facing the NullPointerException below when restoring from a
>> checkpoint in the following scenario:
>> 1/ run job (with local[2]), process data from Kafka while creating &
>> keeping state
>> 2/ stop the job
>> 3/ generate some extra message on the input Kafka topic
>> 4/ start the job again (and restore offsets & state from the checkpoints)
>>
>> The problem is caused by (or at least related to) step 3, i.e. publishing
>> data to the input topic while the job is stopped.
>> The above scenario has been tested successfully when:
>> - step 3 is excluded, so restoring state from a checkpoint is successful
>> when no messages are added when the job is stopped
>> - after step 2, the checkpoints are deleted
>>
>> Any clues? Am I doing something wrong here, or is there still a problem
>> with the mapWithState impl?
>>
>> Thanx,
>>
>> Jan
>>
>>
>>
>> 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage
>> 3.0 (TID 24)
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at
>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory
>> on localhost:10003 (size: 1024.0 B, free: 511.1 MB)
>> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
>> non-empty blocks out of 8 blocks
>> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0
>> remote fetches in 0 ms
>> 15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as
>> values in memory (estimated size 1824.0 B, free 488.0 KB)
>> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory
>> on localhost:10003 (size: 1824.0 B, free: 511.1 MB)
>> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
>> non-empty blocks out of 8 blocks
>> 15/12/29

59 matches

Mail list logo