FYI - marking data type APIs stable

2016-10-10 Thread Reynold Xin
I noticed today that our data types APIs (org.apache.spark.sql.types) are
actually DeveloperApis, which means they can be changed from one feature
release to another. In reality these APIs have been there since the
original introduction of the DataFrame API in Spark 1.3, and has not seen
any breaking changes since then. It makes more sense to mark them stable.

There are also a number of DataFrame related classes that have been
Experimental or DeveloperApi for eternity. I will be marking these stable
in the upcoming feature release (2.1).


Please shout if you disagree.


Kryo on Zeppelin

2016-10-10 Thread Fei Hu
Hi All,

I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version
1.5.0). I customized the Spark interpreter to use org.apache.spark.
serializer.KryoSerializer as spark.serializer. And in the dependency I
added Kyro-3.0.3 as following:
 com.esotericsoftware:kryo:3.0.3


When I wrote the scala notebook and run the program, I got the following
errors. But If I compiled these code as jars, and use spark-submit to run
it on the cluster, it worked well without errors.

WARN [2016-10-10 23:43:40,801] ({task-result-getter-1}
Logging.scala[logWarning]:71) - Lost task 0.0 in stage 3.0 (TID 9,
svr-A3-A-U20): java.io.EOFException

at org.apache.spark.serializer.KryoDeserializationStream.
readObject(KryoSerializer.scala:196)

at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(
TorrentBroadcast.scala:217)

at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$
readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1175)

at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(
TorrentBroadcast.scala:165)

at org.apache.spark.broadcast.TorrentBroadcast._value$
lzycompute(TorrentBroadcast.scala:64)

at org.apache.spark.broadcast.TorrentBroadcast._value(
TorrentBroadcast.scala:64)

at org.apache.spark.broadcast.TorrentBroadcast.getValue(
TorrentBroadcast.scala:88)

at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(
Executor.scala:214)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


There were also some errors when I run the Zeppelin Tutorial:

Caused by: java.io.IOException: java.lang.NullPointerException

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)

at org.apache.spark.rdd.ParallelCollectionPartition.readObject(
ParallelCollectionRDD.scala:70)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(
NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at java.io.ObjectStreamClass.invokeReadObject(
ObjectStreamClass.java:1058)

at java.io.ObjectInputStream.readSerialData(
ObjectInputStream.java:1900)

at java.io.ObjectInputStream.readOrdinaryObject(
ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.
java:1351)

at java.io.ObjectInputStream.defaultReadFields(
ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(
ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(
ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.
java:1351)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

at org.apache.spark.serializer.JavaDeserializationStream.
readObject(JavaSerializer.scala:72)

at org.apache.spark.serializer.JavaSerializerInstance.
deserialize(JavaSerializer.scala:98)

at org.apache.spark.executor.Executor$TaskRunner.run(
Executor.scala:194)

... 3 more

Caused by: java.lang.NullPointerException

at com.twitter.chill.WrappedArraySerializer.read(
WrappedArraySerializer.scala:38)

at com.twitter.chill.WrappedArraySerializer.read(
WrappedArraySerializer.scala:23)

at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

at org.apache.spark.serializer.KryoDeserializationStream.
readObject(KryoSerializer.scala:192)

at org.apache.spark.rdd.ParallelCollectionPartition$$
anonfun$readObject$1$$anonfun$apply$mcV$sp$2.apply(
ParallelCollectionRDD.scala:80)

at org.apache.spark.rdd.ParallelCollectionPartition$$
anonfun$readObject$1$$anonfun$apply$mcV$sp$2.apply(
ParallelCollectionRDD.scala:80)

at org.apache.spark.util.Utils$.deserializeViaNestedStream(
Utils.scala:142)

at org.apache.spark.rdd.ParallelCollectionPartition$$
anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:80)

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)

Is there anyone knowing why it happended?

Thanks in advance,
Fei


[no subject]

2016-10-10 Thread Fei Hu
Hi All,

I am running some spark scala code on zeppelin on CDH 5.5.1 (Spark version
1.5.0). I customized the Spark interpreter to use
org.apache.spark.serializer.KryoSerializer as spark.serializer. And in the
dependency I added Kyro-3.0.3 as following:
 com.esotericsoftware:kryo:3.0.3


When I wrote the scala notebook and run the program, I got the following
errors. But If I compiled these code as jars, and use spark-submit to run
it on the cluster, it worked well without errors.

WARN [2016-10-10 23:43:40,801] ({task-result-getter-1}
Logging.scala[logWarning]:71) - Lost task 0.0 in stage 3.0 (TID 9,
svr-A3-A-U20): java.io.EOFException

at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:196)

at
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)

at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1175)

at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)

at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)

at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)

at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)

at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


There were also some errors when I run the Zeppelin Tutorial:

Caused by: java.io.IOException: java.lang.NullPointerException

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1163)

at
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)

at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)

at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)

at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)

... 3 more

Caused by: java.lang.NullPointerException

at
com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:38)

at
com.twitter.chill.WrappedArraySerializer.read(WrappedArraySerializer.scala:23)

at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)

at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:192)

at
org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1$$anonfun$apply$mcV$sp$2.apply(ParallelCollectionRDD.scala:80)

at
org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1$$anonfun$apply$mcV$sp$2.apply(ParallelCollectionRDD.scala:80)

at
org.apache.spark.util.Utils$.deserializeViaNestedStream(Utils.scala:142)

at
org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:80)

at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1160)

Is there anyone knowing why it happended?

Thanks in advance,
Fei


Re: Quotes within a table name (phoenix table) getting failure: identifier expected at Spark level parsing

2016-10-10 Thread Xiao Li
Hi, Nico,

It sounds like you hit a bug in Phoenix Connector. Our general JDBC
connector already fixed it, I think.

Thanks,

Xiao

2016-10-10 15:29 GMT-07:00 Nico Pappagianis :

> Hi Xiao, when I try that it gets past spark's sql parser then errors out
> at the phoenix sql parser.
>
> org.apache.phoenix.exception.PhoenixParserException: ERROR 601 (42P00):
> Syntax error. Unexpected char: '`'
>
> at org.apache.phoenix.exception.PhoenixParserException.newException(
> PhoenixParserException.java:33)
>
> at org.apache.phoenix.parse.SQLParser.parseStatement(SQLParser.java:118)
>
> at org.apache.phoenix.jdbc.PhoenixStatement$PhoenixStatementParser.
> parseStatement(PhoenixStatement.java:1228)
>
> at org.apache.phoenix.jdbc.PhoenixStatement.parseStatement(
> PhoenixStatement.java:1311)
>
> at org.apache.phoenix.jdbc.PhoenixPreparedStatement.(
> PhoenixPreparedStatement.java:94)
>
> at org.apache.phoenix.jdbc.PhoenixConnection.prepareStatement(
> PhoenixConnection.java:714)
>
>
> It appears that Phoenix and Spark's query parsers are in disagreement.
>
> Any ideas?
>
>
> Thanks!
>
> On Mon, Oct 10, 2016 at 3:10 PM, Xiao Li  wrote:
>
>> HI, Nico,
>>
>> We use back ticks to quote it. For example,
>>
>> CUSTOM_ENTITY.`z02`
>>
>> Thanks,
>>
>> Xiao Li
>>
>> 2016-10-10 12:49 GMT-07:00 Nico Pappagianis <
>> nico.pappagia...@salesforce.com>:
>>
>>> Hello,
>>>
>>> *Some context:*
>>> I have a Phoenix tenant-specific view named CUSTOM_ENTITY."z02" (Phoenix
>>> tables can have quotes to specify case-sensitivity). I am attempting to
>>> write to this table using Spark via a scala script. I am performing the
>>> following read successfully:
>>>
>>> val table = """CUSTOM_ENTITY."z02
>>> val tenantId = "myTenantId"
>>> val urlWithTenant = "jdbc:phoenix:myZKHost1, myZKH
>>> ost1, myZKHost2, myZKHost3:2181;TenantId=myTenantId"
>>> val driver = "org.apache.phoenix.jdbc.PhoenixDriver"
>>>
>>> val readOptions = Map(driver" -> driver, "url" -> urlWithTenant,
>>> "dbtable" -> table
>>> )
>>>
>>> val df = sqlContext.read.format("jdbc").options(readOptions).load
>>>
>>> This gives me the dataframe with data successfully read from my tenant
>>> view.
>>>
>>> Now when I try to write back with this dataframe:
>>>
>>> df.write.format("jdbc").insertInto(table)
>>>
>>>
>>> I am getting the following exception:
>>>
>>> java.lang.RuntimeException: [1.15] failure: identifier expected
>>>
>>> CUSTOM_ENTITY."z02"
>>>
>>>   ^
>>>
>>> (caret is pointing under the '.' before "z02")
>>>
>>> at scala.sys.package$.error(package.scala:27)
>>>
>>> at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifie
>>> r(SqlParser.scala:56)
>>>
>>> at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWri
>>> ter.scala:164)
>>>
>>> Looking at the stack trace it appears that Spark doesn't know what to do
>>> with the quotes around z02. I've tried escaping them in every way I could
>>> think of but to no avail.
>>>
>>> Is there a way to have Spark not complain about the quotes and correctly
>>> pass them along?
>>>
>>> Thanks
>>>
>>
>>
>


Re: Improving governance / committers (split from Spark Improvement Proposals thread)

2016-10-10 Thread Holden Karau
I think it is really important to ensure that someone with a good
understanding of Kafka is empowered around this component with a formal
voice around - but I don't have much dev experience with our Kafka
connectors so I can't speak to the specifics around it personally.

More generally, I also feel pretty strongly about commit bits, and while
I've been going back through the old Python JIRAs and PRs it's seems we are
leaving some good stuff out just because of reviewer bandwidth (not to
mention the people that get turned away from contributing more after their
first interaction or lack their of). Certainly the Python reviewer(s) knows
their stuff - but it feels like for Python there just isn't enough
committer time available to handle the contributor interest. Although - to
be fair - this may be one of those cases where as we add more committers we
will have more contributors never having enough time, but I see that as a
positive cycle we should embrace.

I'm curious - are developers working more in other components feeling
similarly? I've sort of assumed so personally - but it would be nice to
hear others experiences as well.

Of course my disclaimer from the original conversation applies

- I do very much "have a horse in the race" so I will avoid proposing new
criteria. I working on Spark is a core part of what I do most days, and
once my day job with Spark is done I go and do even more Spark like working
on a new Spark book focused on performance right now - and I very much do
want to see a healthy community flourish around Spark :)

More thoughts in-line:

On Sat, Oct 8, 2016 at 5:03 PM, Cody Koeninger  wrote:

> It's not about technical design disagreement as to matters of taste,
> it's about familiarity with the domain.  To make an analogy, it's as
> if a committer in MLlib was firmly intent on, I dunno, treating a
> collection of categorical variables as if it were an ordered range of
> continuous variables.  It's just wrong.  That kind of thing, to a
> greater or lesser degree, has been going on related to the Kafka
> modules, for years.
>
> On Sat, Oct 8, 2016 at 4:11 PM, Matei Zaharia 
> wrote:
> > This makes a lot of sense; just to comment on a few things:
> >
> >> - More committers
> >> Just looking at the ratio of committers to open tickets, or committers
> >> to contributors, I don't think you have enough human power.
> >> I realize this is a touchy issue.  I don't have dog in this fight,
> >> because I'm not on either coast nor in a big company that views
> >> committership as a political thing.  I just think you need more people
> >> to do the work, and more diversity of viewpoint.
> >> It's unfortunate that the Apache governance process involves giving
> >> someone all the keys or none of the keys, but until someone really
> >> starts screwing up, I think it's better to err on the side of
> >> accepting hard-working people.
> >
> > This is something the PMC is actively discussing. Historically, we've
> added committers when people contributed a new module or feature, basically
> to the point where other developers are asking them to review changes in
> that area (https://cwiki.apache.org/confluence/display/SPARK/Committer
> s#Committers-BecomingaCommitter). For example, we added the original
> authors of GraphX when we merged in GraphX, the authors of new ML
> algorithms, etc. However, there's a good argument that some areas are
> simply not covered well now and we should add people there. Also, as the
> project has grown, there are also more people who focus on smaller fixes
> and are nonetheless contributing a lot.
>

I'm happy to hear this is something being actively discussed by the PMC.
I'm also glad the PMC took the time to create some documentation around
what it takes to be a committer - but, to me, it seems like there are maybe
some additional requirements or nuances to the requirements/process which
haven't quite been fully captured in the current wiki and I look forward to
seeing the result of the conversation and the clarity or changes it can
bring to the process.

I realize the default for the PMC may be to have the conversation around
this on private@ - but I think the dev (and maybe even user) community as a
whole is rather interested and we all could benefit by working together on
this (or at least being aware of the PMCs thoughts around this).With the
decisions and discussions around the committer process happen on the
private mailing list (or in person) its really difficult as an outsider (or
contributor interested in being a committer) feel that one has a good
understanding of what is going on. Sean Owen and Matei each provided some
insight from their points of view in Cody's initial thread

along with some additional 

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
If someone wants to tell me that it's OK and "The Apache Way" for
Kafka and Flink to have a proposal process that ends in a lazy
majority, but it's not OK for Spark to have a proposal process that
ends in a non-lazy consensus...

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#KafkaImprovementProposals-Process

In practice any PMC member can stop a proposal they don't like, so I'm
not sure how much it matters.



On Mon, Oct 10, 2016 at 5:59 PM, Mark Hamstra  wrote:
> There is a larger issue to keep in mind, and that is that what you are
> proposing is a procedure that, as far as I am aware, hasn't previously been
> adopted in an Apache project, and thus is not an easy or exact fit with
> established practices that have been blessed as "The Apache Way".  As such,
> we need to be careful, because we have run into some trouble in the past
> with some inside the ASF but essentially outside the Spark community who
> didn't like the way we were doing things.
>
> On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger  wrote:
>>
>> Apache documents say lots of confusing stuff, including that commiters are
>> in practice given a vote.
>>
>> https://www.apache.org/foundation/voting.html
>>
>> I don't care either way, if someone wants me to sub commiter for PMC in
>> the voting section, fine, we just need a clear outcome.
>>
>>
>> On Oct 10, 2016 17:36, "Mark Hamstra"  wrote:
>>>
>>> If I'm correctly understanding the kind of voting that you are talking
>>> about, then to be accurate, it is only the PMC members that have a vote, not
>>> all committers:
>>> https://www.apache.org/foundation/how-it-works.html#pmc-members
>>>
>>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger 
>>> wrote:

 I think the main value is in being honest about what's going on.  No
 one other than committers can cast a meaningful vote, that's the
 reality.  Beyond that, if people think it's more open to allow formal
 proposals from anyone, I'm not necessarily against it, but my main
 question would be this:

 If anyone can submit a proposal, are committers actually going to
 clearly reject and close proposals that don't meet the requirements?

 Right now we have a serious problem with lack of clarity regarding
 contributions, and that cannot spill over into goal-setting.

 On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue  wrote:
 > +1 to votes to approve proposals. I agree that proposals should have
 > an
 > official mechanism to be accepted, and a vote is an established means
 > of
 > doing that well. I like that it includes a period to review the
 > proposal and
 > I think proposals should have been discussed enough ahead of a vote to
 > survive the possibility of a veto.
 >
 > I also like the names that are short and (mostly) unique, like SEP.
 >
 > Where I disagree is with the requirement that a committer must
 > formally
 > propose an enhancement. I don't see the value of restricting this: if
 > someone has the will to write up a proposal then they should be
 > encouraged
 > to do so and start a discussion about it. Even if there is a political
 > reality as Cody says, what is the value of codifying that in our
 > process? I
 > think restricting who can submit proposals would only undermine them
 > by
 > pushing contributors out. Maybe I'm missing something here?
 >
 > rb
 >
 >
 >
 > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger 
 > wrote:
 >>
 >> Yes, users suggesting SIPs is a good thing and is explicitly called
 >> out in the linked document under the Who? section.  Formally
 >> proposing
 >> them, not so much, because of the political realities.
 >>
 >> Yes, implementation strategy definitely affects goals.  There are all
 >> kinds of examples of this, I'll pick one that's my fault so as to
 >> avoid sounding like I'm blaming:
 >>
 >> When I implemented the Kafka DStream, one of my (not explicitly
 >> agreed
 >> upon by the community) goals was to make sure people could use the
 >> Dstream with however they were already using Kafka at work.  The lack
 >> of explicit agreement on that goal led to all kinds of fighting with
 >> committers, that could have been avoided.  The lack of explicit
 >> up-front strategy discussion led to the DStream not really working
 >> with compacted topics.  I knew about compacted topics, but don't have
 >> a use for them, so had a blind spot there.  If there was explicit
 >> up-front discussion that my strategy was "assume that batches can be
 >> defined on the driver solely by beginning and ending offsets",
 >> there's
 >> a greater chance that a user would have seen that and said, "hey,
 >> 

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
There is a larger issue to keep in mind, and that is that what you are
proposing is a procedure that, as far as I am aware, hasn't previously been
adopted in an Apache project, and thus is not an easy or exact fit with
established practices that have been blessed as "The Apache Way".  As such,
we need to be careful, because we have run into some trouble in the past
with some inside the ASF but essentially outside the Spark community who
didn't like the way we were doing things.

On Mon, Oct 10, 2016 at 3:53 PM, Cody Koeninger  wrote:

> Apache documents say lots of confusing stuff, including that commiters are
> in practice given a vote.
>
> https://www.apache.org/foundation/voting.html
>
> I don't care either way, if someone wants me to sub commiter for PMC in
> the voting section, fine, we just need a clear outcome.
>
> On Oct 10, 2016 17:36, "Mark Hamstra"  wrote:
>
>> If I'm correctly understanding the kind of voting that you are talking
>> about, then to be accurate, it is only the PMC members that have a vote,
>> not all committers: https://www.apache.org/foundation/how-it-works.
>> html#pmc-members
>>
>> On Mon, Oct 10, 2016 at 12:02 PM, Cody Koeninger 
>> wrote:
>>
>>> I think the main value is in being honest about what's going on.  No
>>> one other than committers can cast a meaningful vote, that's the
>>> reality.  Beyond that, if people think it's more open to allow formal
>>> proposals from anyone, I'm not necessarily against it, but my main
>>> question would be this:
>>>
>>> If anyone can submit a proposal, are committers actually going to
>>> clearly reject and close proposals that don't meet the requirements?
>>>
>>> Right now we have a serious problem with lack of clarity regarding
>>> contributions, and that cannot spill over into goal-setting.
>>>
>>> On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue  wrote:
>>> > +1 to votes to approve proposals. I agree that proposals should have an
>>> > official mechanism to be accepted, and a vote is an established means
>>> of
>>> > doing that well. I like that it includes a period to review the
>>> proposal and
>>> > I think proposals should have been discussed enough ahead of a vote to
>>> > survive the possibility of a veto.
>>> >
>>> > I also like the names that are short and (mostly) unique, like SEP.
>>> >
>>> > Where I disagree is with the requirement that a committer must formally
>>> > propose an enhancement. I don't see the value of restricting this: if
>>> > someone has the will to write up a proposal then they should be
>>> encouraged
>>> > to do so and start a discussion about it. Even if there is a political
>>> > reality as Cody says, what is the value of codifying that in our
>>> process? I
>>> > think restricting who can submit proposals would only undermine them by
>>> > pushing contributors out. Maybe I'm missing something here?
>>> >
>>> > rb
>>> >
>>> >
>>> >
>>> > On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger 
>>> wrote:
>>> >>
>>> >> Yes, users suggesting SIPs is a good thing and is explicitly called
>>> >> out in the linked document under the Who? section.  Formally proposing
>>> >> them, not so much, because of the political realities.
>>> >>
>>> >> Yes, implementation strategy definitely affects goals.  There are all
>>> >> kinds of examples of this, I'll pick one that's my fault so as to
>>> >> avoid sounding like I'm blaming:
>>> >>
>>> >> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>> >> upon by the community) goals was to make sure people could use the
>>> >> Dstream with however they were already using Kafka at work.  The lack
>>> >> of explicit agreement on that goal led to all kinds of fighting with
>>> >> committers, that could have been avoided.  The lack of explicit
>>> >> up-front strategy discussion led to the DStream not really working
>>> >> with compacted topics.  I knew about compacted topics, but don't have
>>> >> a use for them, so had a blind spot there.  If there was explicit
>>> >> up-front discussion that my strategy was "assume that batches can be
>>> >> defined on the driver solely by beginning and ending offsets", there's
>>> >> a greater chance that a user would have seen that and said, "hey, what
>>> >> about non-contiguous offsets in a compacted topic".
>>> >>
>>> >> This kind of thing is only going to happen smoothly if we have a
>>> >> lightweight user-visible process with clear outcomes.
>>> >>
>>> >> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>> >>  wrote:
>>> >> > I agree with most of what Cody said.
>>> >> >
>>> >> > Two things:
>>> >> >
>>> >> > First we can always have other people suggest SIPs but mark them as
>>> >> > “unreviewed” and have committers basically move them forward. The
>>> >> > problem is
>>> >> > that writing a good document takes time. This way we can leverage
>>> non
>>> >> > committers to do some of this work 

Re: Spark Improvement Proposals

2016-10-10 Thread Mark Hamstra
I'm not a fan of the SEP acronym.  Besides it prior established meaning of
"Somebody else's problem", the are other inappropriate or offensive
connotations such as this Australian slang that often gets shortened to
just "sep":  http://www.urbandictionary.com/define.php?term=Seppo

On Sun, Oct 9, 2016 at 4:00 PM, Nicholas Chammas  wrote:

> On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger  wrote:
>
>> Regarding name, if the SIP overlap is a concern, we can pick a different
>> name.
>>
>> My tongue in cheek suggestion would be
>>
>> Spark Lightweight Improvement process (SPARKLI)
>>
>
> If others share my minor concern about the SIP name, I propose Spark
> Enhancement Proposal (SEP), taking inspiration from the Python Enhancement
> Proposal name.
>
> So if we're going to number proposals like other projects do, they'd be
> numbered SEP-1, SEP-2, etc. This avoids the naming conflict with Scala SIPs.
>
> Another way to avoid a conflict is to stick with "Spark Improvement
> Proposal" but use SPIP as the acronym. So SPIP-1, SPIP-2, etc.
>
> Anyway, it's not a big deal. I just wanted to raise this point.
>
> Nick
>


Re: Quotes within a table name (phoenix table) getting failure: identifier expected at Spark level parsing

2016-10-10 Thread Xiao Li
HI, Nico,

We use back ticks to quote it. For example,

CUSTOM_ENTITY.`z02`

Thanks,

Xiao Li

2016-10-10 12:49 GMT-07:00 Nico Pappagianis :

> Hello,
>
> *Some context:*
> I have a Phoenix tenant-specific view named CUSTOM_ENTITY."z02" (Phoenix
> tables can have quotes to specify case-sensitivity). I am attempting to
> write to this table using Spark via a scala script. I am performing the
> following read successfully:
>
> val table = """CUSTOM_ENTITY."z02
> val tenantId = "myTenantId"
> val urlWithTenant = "jdbc:phoenix:myZKHost1, myZKHost1, myZKHost2,
> myZKHost3:2181;TenantId=myTenantId"
> val driver = "org.apache.phoenix.jdbc.PhoenixDriver"
>
> val readOptions = Map(driver" -> driver, "url" -> urlWithTenant, "dbtable"
> -> table
> )
>
> val df = sqlContext.read.format("jdbc").options(readOptions).load
>
> This gives me the dataframe with data successfully read from my tenant
> view.
>
> Now when I try to write back with this dataframe:
>
> df.write.format("jdbc").insertInto(table)
>
>
> I am getting the following exception:
>
> java.lang.RuntimeException: [1.15] failure: identifier expected
>
> CUSTOM_ENTITY."z02"
>
>   ^
>
> (caret is pointing under the '.' before "z02")
>
> at scala.sys.package$.error(package.scala:27)
>
> at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(
> SqlParser.scala:56)
>
> at org.apache.spark.sql.DataFrameWriter.insertInto(
> DataFrameWriter.scala:164)
>
> Looking at the stack trace it appears that Spark doesn't know what to do
> with the quotes around z02. I've tried escaping them in every way I could
> think of but to no avail.
>
> Is there a way to have Spark not complain about the quotes and correctly
> pass them along?
>
> Thanks
>


Quotes within a table name (phoenix table) getting failure: identifier expected at Spark level parsing

2016-10-10 Thread Nico Pappagianis
Hello,

*Some context:*
I have a Phoenix tenant-specific view named CUSTOM_ENTITY."z02" (Phoenix
tables can have quotes to specify case-sensitivity). I am attempting to
write to this table using Spark via a scala script. I am performing the
following read successfully:

val table = """CUSTOM_ENTITY."z02
val tenantId = "myTenantId"
val urlWithTenant =
"jdbc:phoenix:myZKHost1, myZKHost1, myZKHost2,
myZKHost3:2181;TenantId=myTenantId"
val driver = "org.apache.phoenix.jdbc.PhoenixDriver"

val readOptions = Map(driver" -> driver, "url" -> urlWithTenant, "dbtable"
-> table
)

val df = sqlContext.read.format("jdbc").options(readOptions).load

This gives me the dataframe with data successfully read from my tenant view.

Now when I try to write back with this dataframe:

df.write.format("jdbc").insertInto(table)


I am getting the following exception:

java.lang.RuntimeException: [1.15] failure: identifier expected

CUSTOM_ENTITY."z02"

  ^

(caret is pointing under the '.' before "z02")

at scala.sys.package$.error(package.scala:27)

at
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:56)

at
org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:164)

Looking at the stack trace it appears that Spark doesn't know what to do
with the quotes around z02. I've tried escaping them in every way I could
think of but to no avail.

Is there a way to have Spark not complain about the quotes and correctly
pass them along?

Thanks


Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
Updated on github,
https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

I believe I've touched on all feedback with the exception of naming,
and API vs Strategy.

Do we want a straw poll on naming?

Matei, are your concerns about api vs strategy addressed if we add an
API bullet point to the template?

On Mon, Oct 10, 2016 at 2:38 PM, Steve Loughran  wrote:
> This is an interesting process proposal; I think it could work well.
>
> -It's got the flavour of the ASF incubator; maybe some of the processes 
> there: mentor, regular reporting in could help, in particular, help stop the 
> -1 at the end of the work
> -it may also aid collaboration to have a medium lived branch, so enabling 
> collaboration with multiple people submitting PRs into the ASF codebase. This 
> can reduce cost of merge and enable jenkins to keep on top of it. It also 
> fits in well with the ASF "do in apache infra" community development process.
>
>
>> On 10 Oct 2016, at 20:26, Matei Zaharia  wrote:
>>
>> Agreed with this. As I said before regarding who submits: it's not a normal 
>> ASF process to require contributions to only come from committers. 
>> Committers are of course the only people who can *commit* stuff. But the 
>> whole point of an open source project is that anyone can *contribute* -- 
>> indeed, that is how people become committers. For example, in every ASF 
>> project, anyone can open JIRAs, submit design docs, submit patches, review 
>> patches, and vote on releases. This particular process is very similar to 
>> posting a JIRA or a design doc.
>>
>> I also like consensus with a deadline (e.g. someone says "here is a new SEP, 
>> we want to accept it by date X so please comment before").
>>
>> In general, with this type of stuff, it's better to start with very 
>> lightweight processes and then expand them if needed. Adding lots of rules 
>> from the beginning makes it confusing and can reduce contributions. 
>> Although, as engineers, we believe that anything can be solved using 
>> mechanical rules, in practice software development is a social process that 
>> ultimately requires humans to tackle things on a case-by-case basis.
>>
>> Matei
>>
>>
>>> On Oct 10, 2016, at 12:19 PM, Cody Koeninger  wrote:
>>>
>>> That seems reasonable to me.
>>>
>>> I do not want to see lazy consensus used on one of these proposals
>>> though, I want a clear outcome, i.e. call for a vote, wait at least 72
>>> hours, get three +1s and no vetos.
>>>
>>>
>>>
>>> On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue  wrote:
 Proposal submission: I think we should keep this as open as possible. If
 there is a problem with too many open proposals, then we should tackle that
 as a fix rather than excluding participation. Perhaps it will end up that
 way, but I think it's worth trying a more open model first.

 Majority vs consensus: My rationale is that I don't think we want to
 consider a proposal approved if it had objections serious enough that
 committers down-voted (or PMC depending on who gets a vote). If these
 proposals are like PEPs, then they represent a significant amount of
 community effort and I wouldn't want to move forward if up to half of the
 community thinks it's an untenable idea.

 rb

 On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger  
 wrote:
>
> I think this is closer to a procedural issue than a code modification
> issue, hence why majority.  If everyone thinks consensus is better, I
> don't care.  Again, I don't feel strongly about the way we achieve
> clarity, just that we achieve clarity.
>
> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue  wrote:
>> Sorry, I missed that the proposal includes majority approval. Why
>> majority
>> instead of consensus? I think we want to build consensus around these
>> proposals and it makes sense to discuss until no one would veto.
>>
>> rb
>>
>> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue  wrote:
>>>
>>> +1 to votes to approve proposals. I agree that proposals should have an
>>> official mechanism to be accepted, and a vote is an established means
>>> of
>>> doing that well. I like that it includes a period to review the
>>> proposal and
>>> I think proposals should have been discussed enough ahead of a vote to
>>> survive the possibility of a veto.
>>>
>>> I also like the names that are short and (mostly) unique, like SEP.
>>>
>>> Where I disagree is with the requirement that a committer must formally
>>> propose an enhancement. I don't see the value of restricting this: if
>>> someone has the will to write up a proposal then they should be
>>> encouraged
>>> to do so and start a discussion about it. Even if there is a political

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
That seems reasonable to me.

I do not want to see lazy consensus used on one of these proposals
though, I want a clear outcome, i.e. call for a vote, wait at least 72
hours, get three +1s and no vetos.



On Mon, Oct 10, 2016 at 2:15 PM, Ryan Blue  wrote:
> Proposal submission: I think we should keep this as open as possible. If
> there is a problem with too many open proposals, then we should tackle that
> as a fix rather than excluding participation. Perhaps it will end up that
> way, but I think it's worth trying a more open model first.
>
> Majority vs consensus: My rationale is that I don't think we want to
> consider a proposal approved if it had objections serious enough that
> committers down-voted (or PMC depending on who gets a vote). If these
> proposals are like PEPs, then they represent a significant amount of
> community effort and I wouldn't want to move forward if up to half of the
> community thinks it's an untenable idea.
>
> rb
>
> On Mon, Oct 10, 2016 at 12:07 PM, Cody Koeninger  wrote:
>>
>> I think this is closer to a procedural issue than a code modification
>> issue, hence why majority.  If everyone thinks consensus is better, I
>> don't care.  Again, I don't feel strongly about the way we achieve
>> clarity, just that we achieve clarity.
>>
>> On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue  wrote:
>> > Sorry, I missed that the proposal includes majority approval. Why
>> > majority
>> > instead of consensus? I think we want to build consensus around these
>> > proposals and it makes sense to discuss until no one would veto.
>> >
>> > rb
>> >
>> > On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue  wrote:
>> >>
>> >> +1 to votes to approve proposals. I agree that proposals should have an
>> >> official mechanism to be accepted, and a vote is an established means
>> >> of
>> >> doing that well. I like that it includes a period to review the
>> >> proposal and
>> >> I think proposals should have been discussed enough ahead of a vote to
>> >> survive the possibility of a veto.
>> >>
>> >> I also like the names that are short and (mostly) unique, like SEP.
>> >>
>> >> Where I disagree is with the requirement that a committer must formally
>> >> propose an enhancement. I don't see the value of restricting this: if
>> >> someone has the will to write up a proposal then they should be
>> >> encouraged
>> >> to do so and start a discussion about it. Even if there is a political
>> >> reality as Cody says, what is the value of codifying that in our
>> >> process? I
>> >> think restricting who can submit proposals would only undermine them by
>> >> pushing contributors out. Maybe I'm missing something here?
>> >>
>> >> rb
>> >>
>> >>
>> >>
>> >> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger 
>> >> wrote:
>> >>>
>> >>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> >>> out in the linked document under the Who? section.  Formally proposing
>> >>> them, not so much, because of the political realities.
>> >>>
>> >>> Yes, implementation strategy definitely affects goals.  There are all
>> >>> kinds of examples of this, I'll pick one that's my fault so as to
>> >>> avoid sounding like I'm blaming:
>> >>>
>> >>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> >>> upon by the community) goals was to make sure people could use the
>> >>> Dstream with however they were already using Kafka at work.  The lack
>> >>> of explicit agreement on that goal led to all kinds of fighting with
>> >>> committers, that could have been avoided.  The lack of explicit
>> >>> up-front strategy discussion led to the DStream not really working
>> >>> with compacted topics.  I knew about compacted topics, but don't have
>> >>> a use for them, so had a blind spot there.  If there was explicit
>> >>> up-front discussion that my strategy was "assume that batches can be
>> >>> defined on the driver solely by beginning and ending offsets", there's
>> >>> a greater chance that a user would have seen that and said, "hey, what
>> >>> about non-contiguous offsets in a compacted topic".
>> >>>
>> >>> This kind of thing is only going to happen smoothly if we have a
>> >>> lightweight user-visible process with clear outcomes.
>> >>>
>> >>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>> >>>  wrote:
>> >>> > I agree with most of what Cody said.
>> >>> >
>> >>> > Two things:
>> >>> >
>> >>> > First we can always have other people suggest SIPs but mark them as
>> >>> > “unreviewed” and have committers basically move them forward. The
>> >>> > problem is
>> >>> > that writing a good document takes time. This way we can leverage
>> >>> > non
>> >>> > committers to do some of this work (it is just another way to
>> >>> > contribute).
>> >>> >
>> >>> >
>> >>> >
>> >>> > As for strategy, in many cases implementation strategy can affect
>> >>> > the
>> >>> > goals.
>> >>> > I 

Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Ofir Manor
Funny, someone from my team talked to me about that idea yesterday.
We use SparkLauncher, but it just calls spark-submit that calls other
scripts that starts a new Java program that tries to submit (in our case in
cluster mode - driver is started in the Spark cluster) and exit.
That make it a challenge to troubleshoot cases where submit fails,
especially when users tries our app on their own spark environment. He
hoped to get a more decent / specific exception if submit failed, or be
able to debug it in an IDE (the actual calling to the master, its response
etc).

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Oct 10, 2016 at 9:13 PM, Russell Spitzer 
wrote:

> Just folks who don't want to use spark-submit, no real use-cases I've seen
> yet.
>
> I didn't know about SparkLauncher myself and I don't think there are any
> official docs on that or launching spark as an embedded library for tests.
>
> On Mon, Oct 10, 2016 at 11:09 AM Matei Zaharia 
> wrote:
>
>> What are the main use cases you've seen for this? Maybe we can add a page
>> to the docs about how to launch Spark as an embedded library.
>>
>> Matei
>>
>> On Oct 10, 2016, at 10:21 AM, Russell Spitzer 
>> wrote:
>>
>> I actually had not seen SparkLauncher before, that looks pretty great :)
>>
>> On Mon, Oct 10, 2016 at 10:17 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> I'm definitely only talking about non-embedded uses here as I also use
>>> embedded Spark (cassandra, and kafka) to run tests. This is almost always
>>> safe since everything is in the same JVM. It's only once we get to
>>> launching against a real distributed env do we end up with issues.
>>>
>>> Since Pyspark uses spark submit in the java gateway i'm not sure if that
>>> matters :)
>>>
>>> The cases I see are usually usually going through main directly, adding
>>> jars programatically.
>>>
>>> Usually ends up with classpath errors (Spark not on the CP, their jar
>>> not on the CP, dependencies not on the cp),
>>> conf errors (executors have the incorrect environment, executor
>>> classpath broken, not understanding spark-defaults won't do anything),
>>> Jar version mismatches
>>> Etc ...
>>>
>>> On Mon, Oct 10, 2016 at 10:05 AM Sean Owen  wrote:
>>>
 I have also 'embedded' a Spark driver without much trouble. It isn't
 that it can't work.

 The Launcher API is ptobably the recommended way to do that though.
 spark-submit is the way to go for non programmatic access.

 If you're not doing one of those things and it is not working, yeah I
 think people would tell you you're on your own. I think that's consistent
 with all the JIRA discussions I have seen over time.


 On Mon, Oct 10, 2016, 17:33 Russell Spitzer 
 wrote:

> I've seen a variety of users attempting to work around using Spark
> Submit with at best middling levels of success. I think it would be 
> helpful
> if the project had a clear statement that submitting an application 
> without
> using Spark Submit is truly for experts only or is unsupported entirely.
>
> I know this is a pretty strong stance and other people have had
> different experiences than me so please let me know what you think :)
>

>>


Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
I think this is closer to a procedural issue than a code modification
issue, hence why majority.  If everyone thinks consensus is better, I
don't care.  Again, I don't feel strongly about the way we achieve
clarity, just that we achieve clarity.

On Mon, Oct 10, 2016 at 2:02 PM, Ryan Blue  wrote:
> Sorry, I missed that the proposal includes majority approval. Why majority
> instead of consensus? I think we want to build consensus around these
> proposals and it makes sense to discuss until no one would veto.
>
> rb
>
> On Mon, Oct 10, 2016 at 11:54 AM, Ryan Blue  wrote:
>>
>> +1 to votes to approve proposals. I agree that proposals should have an
>> official mechanism to be accepted, and a vote is an established means of
>> doing that well. I like that it includes a period to review the proposal and
>> I think proposals should have been discussed enough ahead of a vote to
>> survive the possibility of a veto.
>>
>> I also like the names that are short and (mostly) unique, like SEP.
>>
>> Where I disagree is with the requirement that a committer must formally
>> propose an enhancement. I don't see the value of restricting this: if
>> someone has the will to write up a proposal then they should be encouraged
>> to do so and start a discussion about it. Even if there is a political
>> reality as Cody says, what is the value of codifying that in our process? I
>> think restricting who can submit proposals would only undermine them by
>> pushing contributors out. Maybe I'm missing something here?
>>
>> rb
>>
>>
>>
>> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger 
>> wrote:
>>>
>>> Yes, users suggesting SIPs is a good thing and is explicitly called
>>> out in the linked document under the Who? section.  Formally proposing
>>> them, not so much, because of the political realities.
>>>
>>> Yes, implementation strategy definitely affects goals.  There are all
>>> kinds of examples of this, I'll pick one that's my fault so as to
>>> avoid sounding like I'm blaming:
>>>
>>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>>> upon by the community) goals was to make sure people could use the
>>> Dstream with however they were already using Kafka at work.  The lack
>>> of explicit agreement on that goal led to all kinds of fighting with
>>> committers, that could have been avoided.  The lack of explicit
>>> up-front strategy discussion led to the DStream not really working
>>> with compacted topics.  I knew about compacted topics, but don't have
>>> a use for them, so had a blind spot there.  If there was explicit
>>> up-front discussion that my strategy was "assume that batches can be
>>> defined on the driver solely by beginning and ending offsets", there's
>>> a greater chance that a user would have seen that and said, "hey, what
>>> about non-contiguous offsets in a compacted topic".
>>>
>>> This kind of thing is only going to happen smoothly if we have a
>>> lightweight user-visible process with clear outcomes.
>>>
>>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>>  wrote:
>>> > I agree with most of what Cody said.
>>> >
>>> > Two things:
>>> >
>>> > First we can always have other people suggest SIPs but mark them as
>>> > “unreviewed” and have committers basically move them forward. The
>>> > problem is
>>> > that writing a good document takes time. This way we can leverage non
>>> > committers to do some of this work (it is just another way to
>>> > contribute).
>>> >
>>> >
>>> >
>>> > As for strategy, in many cases implementation strategy can affect the
>>> > goals.
>>> > I will give  a small example: In the current structured streaming
>>> > strategy,
>>> > we group by the time to achieve a sliding window. This is definitely an
>>> > implementation decision and not a goal. However, I can think of several
>>> > aggregation functions which have the time inside their calculation
>>> > buffer.
>>> > For example, let’s say we want to return a set of all distinct values.
>>> > One
>>> > way to implement this would be to make the set into a map and have the
>>> > value
>>> > contain the last time seen. Multiplying it across the groupby would
>>> > cost a
>>> > lot in performance. So adding such a strategy would have a great effect
>>> > on
>>> > the type of aggregations and their performance which does affect the
>>> > goal.
>>> > Without adding the strategy, it is easy for whoever goes to the design
>>> > document to not think about these cases. Furthermore, it might be
>>> > decided
>>> > that these cases are rare enough so that the strategy is still good
>>> > enough
>>> > but how would we know it without user feedback?
>>> >
>>> > I believe this example is exactly what Cody was talking about. Since
>>> > many
>>> > times implementation strategies have a large effect on the goal, we
>>> > should
>>> > have it discussed when discussing the goals. In addition, while it is
>>> > often
>>> > easy to throw 

Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
I think the main value is in being honest about what's going on.  No
one other than committers can cast a meaningful vote, that's the
reality.  Beyond that, if people think it's more open to allow formal
proposals from anyone, I'm not necessarily against it, but my main
question would be this:

If anyone can submit a proposal, are committers actually going to
clearly reject and close proposals that don't meet the requirements?

Right now we have a serious problem with lack of clarity regarding
contributions, and that cannot spill over into goal-setting.

On Mon, Oct 10, 2016 at 1:54 PM, Ryan Blue  wrote:
> +1 to votes to approve proposals. I agree that proposals should have an
> official mechanism to be accepted, and a vote is an established means of
> doing that well. I like that it includes a period to review the proposal and
> I think proposals should have been discussed enough ahead of a vote to
> survive the possibility of a veto.
>
> I also like the names that are short and (mostly) unique, like SEP.
>
> Where I disagree is with the requirement that a committer must formally
> propose an enhancement. I don't see the value of restricting this: if
> someone has the will to write up a proposal then they should be encouraged
> to do so and start a discussion about it. Even if there is a political
> reality as Cody says, what is the value of codifying that in our process? I
> think restricting who can submit proposals would only undermine them by
> pushing contributors out. Maybe I'm missing something here?
>
> rb
>
>
>
> On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger  wrote:
>>
>> Yes, users suggesting SIPs is a good thing and is explicitly called
>> out in the linked document under the Who? section.  Formally proposing
>> them, not so much, because of the political realities.
>>
>> Yes, implementation strategy definitely affects goals.  There are all
>> kinds of examples of this, I'll pick one that's my fault so as to
>> avoid sounding like I'm blaming:
>>
>> When I implemented the Kafka DStream, one of my (not explicitly agreed
>> upon by the community) goals was to make sure people could use the
>> Dstream with however they were already using Kafka at work.  The lack
>> of explicit agreement on that goal led to all kinds of fighting with
>> committers, that could have been avoided.  The lack of explicit
>> up-front strategy discussion led to the DStream not really working
>> with compacted topics.  I knew about compacted topics, but don't have
>> a use for them, so had a blind spot there.  If there was explicit
>> up-front discussion that my strategy was "assume that batches can be
>> defined on the driver solely by beginning and ending offsets", there's
>> a greater chance that a user would have seen that and said, "hey, what
>> about non-contiguous offsets in a compacted topic".
>>
>> This kind of thing is only going to happen smoothly if we have a
>> lightweight user-visible process with clear outcomes.
>>
>> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>>  wrote:
>> > I agree with most of what Cody said.
>> >
>> > Two things:
>> >
>> > First we can always have other people suggest SIPs but mark them as
>> > “unreviewed” and have committers basically move them forward. The
>> > problem is
>> > that writing a good document takes time. This way we can leverage non
>> > committers to do some of this work (it is just another way to
>> > contribute).
>> >
>> >
>> >
>> > As for strategy, in many cases implementation strategy can affect the
>> > goals.
>> > I will give  a small example: In the current structured streaming
>> > strategy,
>> > we group by the time to achieve a sliding window. This is definitely an
>> > implementation decision and not a goal. However, I can think of several
>> > aggregation functions which have the time inside their calculation
>> > buffer.
>> > For example, let’s say we want to return a set of all distinct values.
>> > One
>> > way to implement this would be to make the set into a map and have the
>> > value
>> > contain the last time seen. Multiplying it across the groupby would cost
>> > a
>> > lot in performance. So adding such a strategy would have a great effect
>> > on
>> > the type of aggregations and their performance which does affect the
>> > goal.
>> > Without adding the strategy, it is easy for whoever goes to the design
>> > document to not think about these cases. Furthermore, it might be
>> > decided
>> > that these cases are rare enough so that the strategy is still good
>> > enough
>> > but how would we know it without user feedback?
>> >
>> > I believe this example is exactly what Cody was talking about. Since
>> > many
>> > times implementation strategies have a large effect on the goal, we
>> > should
>> > have it discussed when discussing the goals. In addition, while it is
>> > often
>> > easy to throw out completely infeasible goals, it is often much harder
>> > to
>> > figure out 

Re: Spark Improvement Proposals

2016-10-10 Thread Ryan Blue
+1 to votes to approve proposals. I agree that proposals should have an
official mechanism to be accepted, and a vote is an established means of
doing that well. I like that it includes a period to review the proposal
and I think proposals should have been discussed enough ahead of a vote to
survive the possibility of a veto.

I also like the names that are short and (mostly) unique, like SEP.

Where I disagree is with the requirement that a committer must formally
propose an enhancement. I don't see the value of restricting this: if
someone has the will to write up a proposal then they should be encouraged
to do so and start a discussion about it. Even if there is a political
reality as Cody says, what is the value of codifying that in our process? I
think restricting who can submit proposals would only undermine them by
pushing contributors out. Maybe I'm missing something here?

rb



On Mon, Oct 10, 2016 at 7:41 AM, Cody Koeninger  wrote:

> Yes, users suggesting SIPs is a good thing and is explicitly called
> out in the linked document under the Who? section.  Formally proposing
> them, not so much, because of the political realities.
>
> Yes, implementation strategy definitely affects goals.  There are all
> kinds of examples of this, I'll pick one that's my fault so as to
> avoid sounding like I'm blaming:
>
> When I implemented the Kafka DStream, one of my (not explicitly agreed
> upon by the community) goals was to make sure people could use the
> Dstream with however they were already using Kafka at work.  The lack
> of explicit agreement on that goal led to all kinds of fighting with
> committers, that could have been avoided.  The lack of explicit
> up-front strategy discussion led to the DStream not really working
> with compacted topics.  I knew about compacted topics, but don't have
> a use for them, so had a blind spot there.  If there was explicit
> up-front discussion that my strategy was "assume that batches can be
> defined on the driver solely by beginning and ending offsets", there's
> a greater chance that a user would have seen that and said, "hey, what
> about non-contiguous offsets in a compacted topic".
>
> This kind of thing is only going to happen smoothly if we have a
> lightweight user-visible process with clear outcomes.
>
> On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
>  wrote:
> > I agree with most of what Cody said.
> >
> > Two things:
> >
> > First we can always have other people suggest SIPs but mark them as
> > “unreviewed” and have committers basically move them forward. The
> problem is
> > that writing a good document takes time. This way we can leverage non
> > committers to do some of this work (it is just another way to
> contribute).
> >
> >
> >
> > As for strategy, in many cases implementation strategy can affect the
> goals.
> > I will give  a small example: In the current structured streaming
> strategy,
> > we group by the time to achieve a sliding window. This is definitely an
> > implementation decision and not a goal. However, I can think of several
> > aggregation functions which have the time inside their calculation
> buffer.
> > For example, let’s say we want to return a set of all distinct values.
> One
> > way to implement this would be to make the set into a map and have the
> value
> > contain the last time seen. Multiplying it across the groupby would cost
> a
> > lot in performance. So adding such a strategy would have a great effect
> on
> > the type of aggregations and their performance which does affect the
> goal.
> > Without adding the strategy, it is easy for whoever goes to the design
> > document to not think about these cases. Furthermore, it might be decided
> > that these cases are rare enough so that the strategy is still good
> enough
> > but how would we know it without user feedback?
> >
> > I believe this example is exactly what Cody was talking about. Since many
> > times implementation strategies have a large effect on the goal, we
> should
> > have it discussed when discussing the goals. In addition, while it is
> often
> > easy to throw out completely infeasible goals, it is often much harder to
> > figure out that the goals are unfeasible without fine tuning.
> >
> >
> >
> >
> >
> > Assaf.
> >
> >
> >
> > From: Cody Koeninger-2 [via Apache Spark Developers List]
> > [mailto:ml-node+[hidden email]]
> > Sent: Monday, October 10, 2016 2:25 AM
> > To: Mendelson, Assaf
> > Subject: Re: Spark Improvement Proposals
> >
> >
> >
> > Only committers should formally submit SIPs because in an apache
> > project only commiters have explicit political power.  If a user can't
> > find a commiter willing to sponsor an SIP idea, they have no way to
> > get the idea passed in any case.  If I can't find a committer to
> > sponsor this meta-SIP idea, I'm out of luck.
> >
> > I do not believe unrealistic goals can be found solely by inspection.
> > We've managed to ignore unrealistic goals even 

Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Russell Spitzer
I actually had not seen SparkLauncher before, that looks pretty great :)

On Mon, Oct 10, 2016 at 10:17 AM Russell Spitzer 
wrote:

> I'm definitely only talking about non-embedded uses here as I also use
> embedded Spark (cassandra, and kafka) to run tests. This is almost always
> safe since everything is in the same JVM. It's only once we get to
> launching against a real distributed env do we end up with issues.
>
> Since Pyspark uses spark submit in the java gateway i'm not sure if that
> matters :)
>
> The cases I see are usually usually going through main directly, adding
> jars programatically.
>
> Usually ends up with classpath errors (Spark not on the CP, their jar not
> on the CP, dependencies not on the cp),
> conf errors (executors have the incorrect environment, executor classpath
> broken, not understanding spark-defaults won't do anything),
> Jar version mismatches
> Etc ...
>
> On Mon, Oct 10, 2016 at 10:05 AM Sean Owen  wrote:
>
> I have also 'embedded' a Spark driver without much trouble. It isn't that
> it can't work.
>
> The Launcher API is ptobably the recommended way to do that though.
> spark-submit is the way to go for non programmatic access.
>
> If you're not doing one of those things and it is not working, yeah I
> think people would tell you you're on your own. I think that's consistent
> with all the JIRA discussions I have seen over time.
>
>
> On Mon, Oct 10, 2016, 17:33 Russell Spitzer 
> wrote:
>
> I've seen a variety of users attempting to work around using Spark Submit
> with at best middling levels of success. I think it would be helpful if the
> project had a clear statement that submitting an application without using
> Spark Submit is truly for experts only or is unsupported entirely.
>
> I know this is a pretty strong stance and other people have had different
> experiences than me so please let me know what you think :)
>
>


Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Russell Spitzer
I'm definitely only talking about non-embedded uses here as I also use
embedded Spark (cassandra, and kafka) to run tests. This is almost always
safe since everything is in the same JVM. It's only once we get to
launching against a real distributed env do we end up with issues.

Since Pyspark uses spark submit in the java gateway i'm not sure if that
matters :)

The cases I see are usually usually going through main directly, adding
jars programatically.

Usually ends up with classpath errors (Spark not on the CP, their jar not
on the CP, dependencies not on the cp),
conf errors (executors have the incorrect environment, executor classpath
broken, not understanding spark-defaults won't do anything),
Jar version mismatches
Etc ...

On Mon, Oct 10, 2016 at 10:05 AM Sean Owen  wrote:

> I have also 'embedded' a Spark driver without much trouble. It isn't that
> it can't work.
>
> The Launcher API is ptobably the recommended way to do that though.
> spark-submit is the way to go for non programmatic access.
>
> If you're not doing one of those things and it is not working, yeah I
> think people would tell you you're on your own. I think that's consistent
> with all the JIRA discussions I have seen over time.
>
>
> On Mon, Oct 10, 2016, 17:33 Russell Spitzer 
> wrote:
>
> I've seen a variety of users attempting to work around using Spark Submit
> with at best middling levels of success. I think it would be helpful if the
> project had a clear statement that submitting an application without using
> Spark Submit is truly for experts only or is unsupported entirely.
>
> I know this is a pretty strong stance and other people have had different
> experiences than me so please let me know what you think :)
>
>


Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Sean Owen
I have also 'embedded' a Spark driver without much trouble. It isn't that
it can't work.

The Launcher API is ptobably the recommended way to do that though.
spark-submit is the way to go for non programmatic access.

If you're not doing one of those things and it is not working, yeah I think
people would tell you you're on your own. I think that's consistent with
all the JIRA discussions I have seen over time.

On Mon, Oct 10, 2016, 17:33 Russell Spitzer 
wrote:

> I've seen a variety of users attempting to work around using Spark Submit
> with at best middling levels of success. I think it would be helpful if the
> project had a clear statement that submitting an application without using
> Spark Submit is truly for experts only or is unsupported entirely.
>
> I know this is a pretty strong stance and other people have had different
> experiences than me so please let me know what you think :)
>


Re: Official Stance on Not Using Spark Submit

2016-10-10 Thread Marcin Tustin
I've done this for some pyspark stuff. I didn't find it especially
problematic.

On Mon, Oct 10, 2016 at 12:58 PM, Reynold Xin  wrote:

> How are they using it? Calling some main function directly?
>
>
> On Monday, October 10, 2016, Russell Spitzer 
> wrote:
>
>> I've seen a variety of users attempting to work around using Spark Submit
>> with at best middling levels of success. I think it would be helpful if the
>> project had a clear statement that submitting an application without using
>> Spark Submit is truly for experts only or is unsupported entirely.
>>
>> I know this is a pretty strong stance and other people have had different
>> experiences than me so please let me know what you think :)
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Spark 2.0.0 job completes but hangs

2016-10-10 Thread jamborta
Hi all,

I have a spark job that takes about an hour to run, in the end it completes
all the task, then the job just hangs and does nothing (it writes to s3 as
the last step, which also gets completed, all files appear on s3).

any ideas how to debug this? 

see the thread dump below: 

"Attach Listener" daemon prio=10 tid=0x7f67a8001000 nid=0x7b90 waiting
on condition [0x]
   java.lang.Thread.State: RUNNABLE

"SparkUI-224" daemon prio=10 tid=0x7f6778001000 nid=0x7b4a waiting on
condition [0x7f67c2cfa000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc77cc010> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
org.spark_project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:531)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:590)
at java.lang.Thread.run(Thread.java:745)

"SparkUI-223" daemon prio=10 tid=0x7f671008e000 nid=0x7b49 waiting on
condition [0x7f67d1b4d000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc77cc010> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
org.spark_project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:531)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:590)
at java.lang.Thread.run(Thread.java:745)

"SparkUI-222" daemon prio=10 tid=0x7f677c006800 nid=0x7b48 waiting on
condition [0x7f67c8e33000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc77cc010> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
org.spark_project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:531)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:590)
at java.lang.Thread.run(Thread.java:745)

"DestroyJavaVM" prio=10 tid=0x7f67ed157000 nid=0x23c9 waiting on
condition [0x]
   java.lang.Thread.State: RUNNABLE

"SparkUI-205" daemon prio=10 tid=0x7f6718002000 nid=0x7aa0 waiting on
condition [0x7f67b82ee000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc77cc010> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
org.spark_project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:531)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:590)
at java.lang.Thread.run(Thread.java:745)

"Scheduler-916842649" prio=10 tid=0x7f672c004000 nid=0x7a9e waiting on
condition [0x7f67b89f5000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0xc78ee520> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at

Auto start spark jobs

2016-10-10 Thread Deepak Sharma
Hi All
Is there any way to schedule the ever running spark in such a way that it
comes up on its own , after the cluster maintenance?


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Spark Improvement Proposals

2016-10-10 Thread Cody Koeninger
Yes, users suggesting SIPs is a good thing and is explicitly called
out in the linked document under the Who? section.  Formally proposing
them, not so much, because of the political realities.

Yes, implementation strategy definitely affects goals.  There are all
kinds of examples of this, I'll pick one that's my fault so as to
avoid sounding like I'm blaming:

When I implemented the Kafka DStream, one of my (not explicitly agreed
upon by the community) goals was to make sure people could use the
Dstream with however they were already using Kafka at work.  The lack
of explicit agreement on that goal led to all kinds of fighting with
committers, that could have been avoided.  The lack of explicit
up-front strategy discussion led to the DStream not really working
with compacted topics.  I knew about compacted topics, but don't have
a use for them, so had a blind spot there.  If there was explicit
up-front discussion that my strategy was "assume that batches can be
defined on the driver solely by beginning and ending offsets", there's
a greater chance that a user would have seen that and said, "hey, what
about non-contiguous offsets in a compacted topic".

This kind of thing is only going to happen smoothly if we have a
lightweight user-visible process with clear outcomes.

On Mon, Oct 10, 2016 at 1:34 AM, assaf.mendelson
 wrote:
> I agree with most of what Cody said.
>
> Two things:
>
> First we can always have other people suggest SIPs but mark them as
> “unreviewed” and have committers basically move them forward. The problem is
> that writing a good document takes time. This way we can leverage non
> committers to do some of this work (it is just another way to contribute).
>
>
>
> As for strategy, in many cases implementation strategy can affect the goals.
> I will give  a small example: In the current structured streaming strategy,
> we group by the time to achieve a sliding window. This is definitely an
> implementation decision and not a goal. However, I can think of several
> aggregation functions which have the time inside their calculation buffer.
> For example, let’s say we want to return a set of all distinct values. One
> way to implement this would be to make the set into a map and have the value
> contain the last time seen. Multiplying it across the groupby would cost a
> lot in performance. So adding such a strategy would have a great effect on
> the type of aggregations and their performance which does affect the goal.
> Without adding the strategy, it is easy for whoever goes to the design
> document to not think about these cases. Furthermore, it might be decided
> that these cases are rare enough so that the strategy is still good enough
> but how would we know it without user feedback?
>
> I believe this example is exactly what Cody was talking about. Since many
> times implementation strategies have a large effect on the goal, we should
> have it discussed when discussing the goals. In addition, while it is often
> easy to throw out completely infeasible goals, it is often much harder to
> figure out that the goals are unfeasible without fine tuning.
>
>
>
>
>
> Assaf.
>
>
>
> From: Cody Koeninger-2 [via Apache Spark Developers List]
> [mailto:ml-node+[hidden email]]
> Sent: Monday, October 10, 2016 2:25 AM
> To: Mendelson, Assaf
> Subject: Re: Spark Improvement Proposals
>
>
>
> Only committers should formally submit SIPs because in an apache
> project only commiters have explicit political power.  If a user can't
> find a commiter willing to sponsor an SIP idea, they have no way to
> get the idea passed in any case.  If I can't find a committer to
> sponsor this meta-SIP idea, I'm out of luck.
>
> I do not believe unrealistic goals can be found solely by inspection.
> We've managed to ignore unrealistic goals even after implementation!
> Focusing on APIs can allow people to think they've solved something,
> when there's really no way of implementing that API while meeting the
> goals.  Rapid iteration is clearly the best way to address this, but
> we've already talked about why that hasn't really worked.  If adding a
> non-binding API section to the template is important to you, I'm not
> against it, but I don't think it's sufficient.
>
> On your PRD vs design doc spectrum, I'm saying this is closer to a
> PRD.  Clear agreement on goals is the most important thing and that's
> why it's the thing I want binding agreement on.  But I cannot agree to
> goals unless I have enough minimal technical info to judge whether the
> goals are likely to actually be accomplished.
>
>
>
> On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia <[hidden email]> wrote:
>
>
>> Well, I think there are a few things here that don't make sense. First,
>> why
>> should only committers submit SIPs? Development in the project should be
>> open to all contributors, whether they're committers or not. Second, I
>> think
>> unrealistic goals can be found just by inspecting the goals, and I'm not

Re: This Exception has been really hard to trace

2016-10-10 Thread kant kodali
Hi
I use gradle and I don't think it really has "provided" but I was able to google
and create the following file but the same error still persist.
group 'com.company'version '1.0-SNAPSHOT'
apply plugin: 'java'apply plugin: 'idea'
repositories {mavenCentral()mavenLocal()}
configurations {provided}sourceSets {main {compileClasspath +=
configurations.providedtest.compileClasspath += configurations.provided
test.runtimeClasspath += configurations.provided}}
idea {module {scopes.PROVIDED.plus += [ configurations.provided ]
}}
dependencies {compile 'org.slf4j:slf4j-log4j12:1.7.12'provided group:
'org.apache.spark', name: 'spark-core_2.11', version: '2.0.0'provided group:
'org.apache.spark', name: 'spark-streaming_2.11', version: '2.0.0'provided
group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.0.0'provided
group: 'com.datastax.spark', name: 'spark-cassandra-connector_2.11', version:
'2.0.0-M3'}


jar {from { configurations.provided.collect { it.isDirectory() ? it :
zipTree(it) } }   // with jarfrom sourceSets.test.outputmanifest {
attributes 'Main-Class': "com.company.batchprocessing.Hello"}
exclude 'META-INF/.RSA', 'META-INF/.SF', 'META-INF/*.DSA'zip64 true}
This successfully creates the jar but the error still persists.
 





On Sun, Oct 9, 2016 11:44 PM, Shixiong(Ryan) Zhu shixi...@databricks.com
wrote:
Seems the runtime Spark is different from the compiled one. You should mark the
Spark components  "provided". See
https://issues.apache.org/jira/browse/SPARK-9219
On Sun, Oct 9, 2016 at 8:13 PM, kant kodali   wrote:

I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.

http://qaoverflow.com/question/how-to-use-spanby-in-java/

java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

JavaPairRDD cassandraRowsRDD=javaFunctions
(sc).cassandraTable("test", "hello" )
.select("col1", "col2", "col3" )
.spanBy(newFunction() {
@Override
publicByteBuffer call(CassandraRow v1) {
returnv1.getBytes("rowkey");
}
}, ByteBuffer.class);

And then here I do this here is where the problem occurs
List> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
so I tried this  and same error
Iterable> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
Although I understand that ByteBuffers aren't serializable I didn't get any not
serializable exception but still I went head and changed everything to byte[] so
no more ByteBuffers in the code.
I have also tried cassandraRowsRDD.collect().forEach() and
cassandraRowsRDD.stream().forEachPartition() and the same exact error occurs.
I am running everything locally and in a stand alone mode so my spark cluster is
just running on localhost.
Scala code runner version 2.11.8  // when I run scala -version or even
./spark-shell

compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
version: '2.0.0-M3':

So I don't see anything wrong with these versions.
2) I am bundling everything into one jar and so far it did worked out well
except for this error.
I am using Java 8 and Gradle.

any ideas on how I can fix this?

Re: Monitoring system extensibility

2016-10-10 Thread Pete Robbins
Yes I agree. I'm not sure how important this is anyway. It's a little
annoying but easy to work around.

On Mon, 10 Oct 2016 at 09:01 Reynold Xin  wrote:

> I just took a quick look and set a target version on the JIRA. But Pete I
> think the primary problem with the JIRA and pull request is that it really
> just argues (or implements) opening up a private API, which is a valid
> point but there are a lot more that needs to be done before making some
> private API public.
>
> At the very least, we need to answer the following:
>
> 1. Is the existing API maintainable? E.g. Is it OK to just expose coda
> hale metrics in the API? Do we need to worry about dependency conflicts?
> Should we wrap it?
>
> 2. Is the existing API sufficiently general (to cover use cases)? What
> about security related setup?
>
>
>
>
> On Fri, Oct 7, 2016 at 2:03 AM, Pete Robbins  wrote:
>
> Which has happened. The last comment being in August with someone saying
> it was important to them. They PR has been around since March and despite a
> request to be reviewed has not got any committer's attention. Without that,
> it is going nowhere. The historic Jiras requesting other sinks such as
> Kafka, OpenTSBD etc have also been ignored.
>
> So for now we continue creating classes in o.a.s package.
>
> On Fri, 7 Oct 2016 at 09:50 Reynold Xin  wrote:
>
> So to be constructive and in order to actually open up these APIs, it
> would be useful for users to comment on the JIRA ticket on their use cases
> (rather than "I want this to be public"), and then we can design an API
> that would address those use cases. In some cases the solution is to just
> make the existing internal API public. But turning some internal API public
> without thinking about whether those APIs are sufficiently expressive and
> maintainable is not a great way to design APIs in general.
>
> On Friday, October 7, 2016, Pete Robbins  wrote:
>
> I brought this up last year and there was a Jira raised:
> https://issues.apache.org/jira/browse/SPARK-14151
>
> For now I just have my SInk and Source in an o.a.s package name which is
> not ideal but the only way round this.
>
> On Fri, 7 Oct 2016 at 08:30 Reynold Xin  wrote:
>
> They have always been private, haven't they?
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/metrics/source/Source.scala
>
>
>
> On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov  > wrote:
>
> Hi.
>
> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
> `org.apache.spark.metrics.sink.Sink` are defined as private to ‘spark’
> package, so it becomes troublesome to create a new implementation in the
> user’s code (but still possible in a hacky way).
> This seems to be the only missing piece to allow extension of the metrics
> system and I wonder whether is was conscious design decision to limit the
> visibility. Is it possible to broaden the visibility scope for these traits
> in the future versions?
>
> Thanks,
> Alexander
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>


Re: Monitoring system extensibility

2016-10-10 Thread Reynold Xin
I just took a quick look and set a target version on the JIRA. But Pete I
think the primary problem with the JIRA and pull request is that it really
just argues (or implements) opening up a private API, which is a valid
point but there are a lot more that needs to be done before making some
private API public.

At the very least, we need to answer the following:

1. Is the existing API maintainable? E.g. Is it OK to just expose coda hale
metrics in the API? Do we need to worry about dependency conflicts? Should
we wrap it?

2. Is the existing API sufficiently general (to cover use cases)? What
about security related setup?




On Fri, Oct 7, 2016 at 2:03 AM, Pete Robbins  wrote:

> Which has happened. The last comment being in August with someone saying
> it was important to them. They PR has been around since March and despite a
> request to be reviewed has not got any committer's attention. Without that,
> it is going nowhere. The historic Jiras requesting other sinks such as
> Kafka, OpenTSBD etc have also been ignored.
>
> So for now we continue creating classes in o.a.s package.
>
> On Fri, 7 Oct 2016 at 09:50 Reynold Xin  wrote:
>
>> So to be constructive and in order to actually open up these APIs, it
>> would be useful for users to comment on the JIRA ticket on their use cases
>> (rather than "I want this to be public"), and then we can design an API
>> that would address those use cases. In some cases the solution is to just
>> make the existing internal API public. But turning some internal API public
>> without thinking about whether those APIs are sufficiently expressive and
>> maintainable is not a great way to design APIs in general.
>>
>> On Friday, October 7, 2016, Pete Robbins  wrote:
>>
>>> I brought this up last year and there was a Jira raised:
>>> https://issues.apache.org/jira/browse/SPARK-14151
>>>
>>> For now I just have my SInk and Source in an o.a.s package name which is
>>> not ideal but the only way round this.
>>>
>>> On Fri, 7 Oct 2016 at 08:30 Reynold Xin  wrote:
>>>
 They have always been private, haven't they?

 https://github.com/apache/spark/blob/branch-1.6/core/
 src/main/scala/org/apache/spark/metrics/source/Source.scala



 On Thu, Oct 6, 2016 at 7:38 AM, Alexander Oleynikov <
 oleyniko...@gmail.com> wrote:

> Hi.
>
> As of v2.0.1, the traits `org.apache.spark.metrics.source.Source` and
> `org.apache.spark.metrics.sink.Sink` are defined as private to
> ‘spark’ package, so it becomes troublesome to create a new implementation
> in the user’s code (but still possible in a hacky way).
> This seems to be the only missing piece to allow extension of the
> metrics system and I wonder whether is was conscious design decision to
> limit the visibility. Is it possible to broaden the visibility scope for
> these traits in the future versions?
>
> Thanks,
> Alexander
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>



Re: This Exception has been really hard to trace

2016-10-10 Thread Shixiong(Ryan) Zhu
Seems the runtime Spark is different from the compiled one. You should
mark the Spark components  "provided". See
https://issues.apache.org/jira/browse/SPARK-9219

On Sun, Oct 9, 2016 at 8:13 PM, kant kodali  wrote:

>
> I tried SpanBy but look like there is a strange error that happening no
> matter which way I try. Like the one here described for Java solution.
>
> http://qaoverflow.com/question/how-to-use-spanby-in-java/
>
>
> *java.lang.ClassCastException: cannot assign instance of
> scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org
> $apache$spark$rdd$RDD$$dependencies_
> of type scala.collection.Seq in instance of
> org.apache.spark.rdd.MapPartitionsRDD*
>
>
> JavaPairRDD cassandraRowsRDD=
> javaFunctions(sc).cassandraTable("test", "hello" )
>.select("col1", "col2", "col3" )
>.spanBy(new Function() {
> @Override
> public ByteBuffer call(CassandraRow v1) {
> return v1.getBytes("rowkey");
> }
> }, ByteBuffer.class);
>
>
> And then here I do this here is where the problem occurs
>
> List> listOftuples =
> cassandraRowsRDD.collect(); // ERROR OCCURS HERE
> Tuple2 tuple =
> listOftuples.iterator().next();
> ByteBuffer partitionKey = tuple._1();
> for(CassandraRow cassandraRow: tuple._2()) {
> System.out.println(cassandraRow.getLong("col1"));
> }
>
> so I tried this  and same error
>
> Iterable> listOftuples =
> cassandraRowsRDD.collect(); // ERROR OCCURS HERE
> Tuple2 tuple =
> listOftuples.iterator().next();
> ByteBuffer partitionKey = tuple._1();
> for(CassandraRow cassandraRow: tuple._2()) {
> System.out.println(cassandraRow.getLong("col1"));
> }
>
> Although I understand that ByteBuffers aren't serializable I didn't get
> any not serializable exception but still I went head and *changed
> everything to byte[] so no more ByteBuffers in the code.*
>
> I have also tried cassandraRowsRDD.collect().forEach() and
> cassandraRowsRDD.stream().forEachPartition() and the same exact error
> occurs.
>
> I am running everything locally and in a stand alone mode so my spark
> cluster is just running on localhost.
>
> Scala code runner version 2.11.8  // when I run scala -version or even
> ./spark-shell
>
>
> compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
> compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version:
> '2.0.0'
> compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
> compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
> version: '2.0.0-M3':
>
>
> So I don't see anything wrong with these versions.
>
> 2) I am bundling everything into one jar and so far it did worked out well
> except for this error.
> I am using Java 8 and Gradle.
>
>
> any ideas on how I can fix this?
>