Re: Hadoop LR comparison

2014-03-31 Thread Tsai Li Ming
Thanks.

What will be equivalent code in Hadoop where Spark published the 110s/0.9s 
comparison?


On 1 Apr, 2014, at 2:44 pm, DB Tsai  wrote:

> Hi Li-Ming,
> 
> This binary logistic regression using SGD is in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
> 
> We're working on multinomial logistic regression using Newton and L-BFGS 
> optimizer now. Will be released soon.
> 
> 
> Sincerely,
> 
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --
> Web: http://alpinenow.com/
> 
> 
> On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming  wrote:
> Hi,
> 
> Is the code available for Hadoop to calculate the Logistic Regression 
> hyperplane?
> 
> I’m looking at the Examples:
> http://spark.apache.org/examples.html,
> 
> where there is the 110s vs 0.9s in Hadoop vs Spark comparison.
> 
> Thanks!
> 



Re: Hadoop LR comparison

2014-03-31 Thread DB Tsai
Hi Li-Ming,

This binary logistic regression using SGD is in
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala

We're working on multinomial logistic regression using Newton and L-BFGS
optimizer now. Will be released soon.


Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/


On Mon, Mar 31, 2014 at 11:38 PM, Tsai Li Ming wrote:

> Hi,
>
> Is the code available for Hadoop to calculate the Logistic Regression
> hyperplane?
>
> I’m looking at the Examples:
> http://spark.apache.org/examples.html,
>
> where there is the 110s vs 0.9s in Hadoop vs Spark comparison.
>
> Thanks!


Hadoop LR comparison

2014-03-31 Thread Tsai Li Ming
Hi,

Is the code available for Hadoop to calculate the Logistic Regression 
hyperplane?

I’m looking at the Examples:
http://spark.apache.org/examples.html,

where there is the 110s vs 0.9s in Hadoop vs Spark comparison.

Thanks!

advanced training or implementation assistance

2014-03-31 Thread Livni, Dana
I have some experience with writing applications using spark.
I have also started to use YourKit to tried and profile my app, and tried to 
detect performance issues.


But I'm not sure if I'm implementing the best practices or how to preform 
advanced profiling of the code.

I'm  looking for a reference to a company/individual that can provide advanced 
training  or give services of implementation assistance/code review/performance 
improvement.


Thanks Dana.

-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
Another problem I noticed is that the current 1.0.0 git tree still gives me the 
ClassNotFoundException. I see that the SPARK-1052 is already fixed there. I 
then modified the pom.xml for mesos and protobuf and that still gave the 
ClassNotFoundException. I also tried modifying pom.xml only for mesos and that 
fails too. So I have no way of running the 1.0.0 git tree spark on mesos yet.

Thanks.

On 01-Apr-2014, at 3:28 am, deric  wrote:

> Which repository do you use?
> 
> The issue should be fixed in 0.9.1 and 1.0.0
> 
> https://spark-project.atlassian.net/browse/SPARK-1052
>   
> 
> There's an old repository 
> 
> https://github.com/apache/incubator-spark
> 
> and as Spark become one of top level projects, it was moved to new repo:
> 
> https://github.com/apache/spark
> 
> The 0.9.1 version hasn't been released yet, so you should get it from the
> new git repo.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-tp3510p3551.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Sonal Goyal
Hi Andy,

I would be interested in setting up a meetup in Delhi/NCR, India. Can you
please let me know how to go about organizing it?

Best Regards,
Sonal
Nube Technologies 






On Tue, Apr 1, 2014 at 10:04 AM, giive chen  wrote:

> Hi Andy
>
> We are from Taiwan. We are already planning to have a Spark meetup.
> We already have some resources like place and food budget. But we do need
> some other resource.
> Please contact me offline.
>
> Thanks
>
> Wisely Chen
>
>
> On Tue, Apr 1, 2014 at 1:28 AM, Andy Konwinski wrote:
>
>> Hi folks,
>>
>> We have seen a lot of community growth outside of the Bay Area and we are
>> looking to help spur even more!
>>
>> For starters, the organizers of the Spark meetups here in the Bay Area
>> want to help anybody that is interested in setting up a meetup in a new
>> city.
>>
>> Some amazing Spark champions have stepped forward in Seattle, Vancouver,
>> Boulder/Denver, and a few other areas already.
>>
>> Right now, we are looking to connect with you Spark enthusiasts in NYC
>> about helping to run an inaugural Spark Meetup in your area.
>>
>> You can reply to me directly if you are interested and I can tell you
>> about all of the resources we have to offer (speakers from the core
>> community, a budget for food, help scheduling, etc.), and let's make this
>> happen!
>>
>> Andy
>>
>
>


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread giive chen
Hi Andy

We are from Taiwan. We are already planning to have a Spark meetup.
We already have some resources like place and food budget. But we do need
some other resource.
Please contact me offline.

Thanks

Wisely Chen


On Tue, Apr 1, 2014 at 1:28 AM, Andy Konwinski wrote:

> Hi folks,
>
> We have seen a lot of community growth outside of the Bay Area and we are
> looking to help spur even more!
>
> For starters, the organizers of the Spark meetups here in the Bay Area
> want to help anybody that is interested in setting up a meetup in a new
> city.
>
> Some amazing Spark champions have stepped forward in Seattle, Vancouver,
> Boulder/Denver, and a few other areas already.
>
> Right now, we are looking to connect with you Spark enthusiasts in NYC
> about helping to run an inaugural Spark Meetup in your area.
>
> You can reply to me directly if you are interested and I can tell you
> about all of the resources we have to offer (speakers from the core
> community, a budget for food, help scheduling, etc.), and let's make this
> happen!
>
> Andy
>


Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I was talking about the protobuf version issue as not fixed. I could not find 
any reference to the problem or the fix.

Reg. SPARK-1052, I could pull in the fix into my 0.9.0 tree (from the tar ball 
on the website) and I see the fix in the latest git.

Thanks

On 01-Apr-2014, at 3:28 am, deric  wrote:

> Which repository do you use?
> 
> The issue should be fixed in 0.9.1 and 1.0.0
> 
> https://spark-project.atlassian.net/browse/SPARK-1052
>   
> 
> There's an old repository 
> 
> https://github.com/apache/incubator-spark
> 
> and as Spark become one of top level projects, it was moved to new repo:
> 
> https://github.com/apache/spark
> 
> The 0.9.1 version hasn't been released yet, so you should get it from the
> new git repo.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-tp3510p3551.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: network wordcount example

2014-03-31 Thread Chris Fregly
@eric-

i saw this exact issue recently while working on the KinesisWordCount.

are you passing "local[2]" to your example as the MASTER arg versus just
"local" or "local[1]"?

you need at least 2.  it's documented as "n>1" in the scala source docs -
which is easy to mistake for n>=1.

i just ran the NetworkWordCount sample and confirmed that local[1] does not
work, but  local[2] does work.

give that a whirl.

-chris




On Mon, Mar 31, 2014 at 10:41 AM, Diana Carroll wrote:

> Not sure what data you are sending in.  You could try calling
> "lines.print()" instead which should just output everything that comes in
> on the stream.  Just to test that your socket is receiving what you think
> you are sending.
>
>
> On Mon, Mar 31, 2014 at 12:18 PM, eric perler wrote:
>
>> Hello
>>
>> i just started working with spark today... and i am trying to run the
>> wordcount network example
>>
>> i created a socket server and client.. and i am sending data to the
>> server in an infinite loop
>>
>> when i run the spark class.. i see this output in the console...
>>
>> ---
>> Time: 1396281891000 ms
>> ---
>>
>> 14/03/31 11:04:51 INFO SparkContext: Job finished: take at
>> DStream.scala:586, took 0.056794606 s
>> 14/03/31 11:04:51 INFO JobScheduler: Finished job streaming job
>> 1396281891000 ms.0 from job set of time 1396281891000 ms
>> 14/03/31 11:04:51 INFO JobScheduler: Total delay: 0.101 s for time
>> 1396281891000 ms (execution: 0.058 s)
>> 14/03/31 11:04:51 INFO TaskSchedulerImpl: Remove TaskSet 3.0 from pool
>>
>> but i dont see any output from the workcount operation when i make this
>> call...
>>
>> wordCounts.print();
>>
>> any help is greatly appreciated
>>
>> thanks in advance
>>
>
>


Calling Spark enthusiasts in Austin, TX

2014-03-31 Thread Ognen Duzlevski
In the spirit of everything being bigger and better in TX ;) => if 
anyone is in Austin and interested in meeting up over Spark - contact 
me! There seems to be a Spark meetup group in Austin that has never met 
and my initial email to organize the first gathering was never acknowledged.

Ognen

On 3/31/14, 2:01 PM, Nick Pentreath wrote:
I would offer to host one in Cape Town but we're almost certainly the 
only Spark users in the country apart from perhaps one in Johanmesburg :)

—
Sent from Mailbox  for iPhone


On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas 
mailto:nicholas.cham...@gmail.com>> wrote:


My fellow Bostonians and New Englanders,

We cannot allow New York to beat us to having a banging Spark meetup.

Respond to me (and I guess also Andy?) if you are interested.

Yana,

I'm not sure either what is involved in organizing, but we can
figure it out. I didn't know about the meetup that never took off.

Nick


On Mon, Mar 31, 2014 at 2:31 PM, Yana Kadiyska <[hidden email]
> wrote:

Nicholas, I'm in Boston and would be interested in a Spark
group. Not
sure if you know this -- there was a meetup that never got off the
ground. Anyway, I'd be +1 for attending. Not sure what is
involved in
organizing. Seems a shame that a city like Boston doesn't have
one.

On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
<[hidden email] >
wrote:
> As in, I am interested in helping organize a Spark meetup in
the Boston
> area.
>
>
> On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
> <[hidden email]
> wrote:
>>
>> Well, since this thread has played out as it has, lemme
throw in a
>> shout-out for Boston.



View this message in context: Calling Spahk enthusiasts in Boston


Sent from the Apache Spark User List mailing list archive
 at Nabble.com.





Re: batching the output

2014-03-31 Thread Patrick Wendell
Ya this is a good way to do it.


On Sun, Mar 30, 2014 at 10:11 PM, Vipul Pandey  wrote:

> Hi,
>
> I need to batch the values in my final RDD before writing out to hdfs. The
> idea is to batch multiple "rows" in a protobuf and write those batches out
> - mostly to save some space as a lot of metadata is the same.
> e.g. 1,2,3,4,5,6 just batch them (1,2), (3,4),(5,6) and save three records
> instead of 6
>
> What I"m doing is that I'm using mapPartitions by using the grouped
> function of the iterator by giving it a groupSize.
>
> val protoRDD:RDD[MyProto] =
> rdd.mapPartitions[Profiles](_.grouped(groupSize).map(seq =>{
> val profiles = MyProto(...)
> seq.foreach(x =>{
>   val row = new Row(x._1.toString)
>   row.setFloatValue(x._2)
>   profiles.addRow(row)
> })
> profiles
>   })
> )
> I haven't been able to test it out because of a separate issue (protobuf
> version mismatch - in a different thread)  - but i'm hoping it will work.
>
> Is there a better/straight-forward way of doing this?
>
> Thanks
> Vipul


Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-03-31 Thread Patrick Wendell
Spark now shades its own protobuf dependency so protobuf 2.4.1 should't be
getting pulled in unless you are directly using akka yourself. Are you?

Does your project have other dependencies that might be indirectly pulling
in protobuf 2.4.1? It would be helpful if you could list all of your
dependencies including the exact Spark version and other libraries.

- Patrick


On Sun, Mar 30, 2014 at 10:03 PM, Vipul Pandey  wrote:

> I'm using ScalaBuff (which depends on protobuf2.5) and facing the same
> issue. any word on this one?
> On Mar 27, 2014, at 6:41 PM, Kanwaldeep  wrote:
>
> > We are using Protocol Buffer 2.5 to send messages to Spark Streaming 0.9
> with
> > Kafka stream setup. I have protocol Buffer 2.5 part of the uber jar
> deployed
> > on each of the spark worker nodes.
> > The message is compiled using 2.5 but then on runtime it is being
> > de-serialized by 2.4.1 as I'm getting the following exception
> >
> > java.lang.VerifyError (java.lang.VerifyError: class
> > com.snc.sinet.messages.XServerMessage$XServer overrides final method
> > getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;)
> > java.lang.ClassLoader.defineClass1(Native Method)
> > java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
> > java.lang.ClassLoader.defineClass(ClassLoader.java:615)
> > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
> >
> > Suggestions on how I could still use ProtoBuf 2.5. Based on the article -
> > https://spark-project.atlassian.net/browse/SPARK-995 we should be able
> to
> > use different version of protobuf in the application.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread deric
Which repository do you use?

The issue should be fixed in 0.9.1 and 1.0.0

https://spark-project.atlassian.net/browse/SPARK-1052
  

There's an old repository 

https://github.com/apache/incubator-spark

and as Spark become one of top level projects, it was moved to new repo:

https://github.com/apache/spark

The 0.9.1 version hasn't been released yet, so you should get it from the
new git repo.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-tp3510p3551.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
Your suggestion took me past the ClassNotFoundException. I then hit 
akka.actor.ActorNotFound exception. I patched in PR 568 into my 0.9.0 spark 
codebase and everything worked.

So thanks a lot, Tim. Is there a JIRA/PR for the protobuf issue? Why is it not 
fixed in the latest git tree?

Thanks.

On 31-Mar-2014, at 11:30 pm, Tim St Clair  wrote:

> It sounds like the protobuf issue. 
> 
> So FWIW, You might want to try updating the 0.9.0 w/pom mods for mesos & 
> protobuf. 
> 
> mesos 0.17.0 & protobuf 2.5   
> 
> Cheers,
> Tim
> 
> - Original Message -
>> From: "Bharath Bhushan" 
>> To: user@spark.apache.org
>> Sent: Monday, March 31, 2014 9:46:32 AM
>> Subject: Re: java.lang.ClassNotFoundException - spark on mesos
>> 
>> I tried 0.9.0 and the latest git tree of spark. For mesos, I tried 0.17.0 and
>> the latest git tree.
>> 
>> Thanks
>> 
>> 
>> On 31-Mar-2014, at 7:24 pm, Tim St Clair  wrote:
>> 
>>> What versions are you running?
>>> 
>>> There is a known protobuf 2.5 mismatch, depending on your versions.
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> - Original Message -
 From: "Bharath Bhushan" 
 To: user@spark.apache.org
 Sent: Monday, March 31, 2014 8:16:19 AM
 Subject: java.lang.ClassNotFoundException - spark on mesos
 
 I am facing different kinds of java.lang.ClassNotFoundException when
 trying
 to run spark on mesos. One error has to do with
 org.apache.spark.executor.MesosExecutorBackend. Another has to do with
 org.apache.spark.serializer.JavaSerializer. I see other people complaining
 about similar issues.
 
 I tried with different version of spark distribution - 0.9.0 and
 1.0.0-SNAPSHOT and faced the same problem. I think the reason for this is
 is
 related to the error below.
 
 $ jar -xf spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
 java.io.IOException: META-INF/license : could not create directory
   at sun.tools.jar.Main.extractFile(Main.java:907)
   at sun.tools.jar.Main.extract(Main.java:850)
   at sun.tools.jar.Main.run(Main.java:240)
   at sun.tools.jar.Main.main(Main.java:1147)
 
 This error happens with all the jars that I created. But the classes that
 are
 already generated is different in the different cases. If JavaSerializer
 is
 not already extracted before encountering META-INF/license, then that
 class
 is not found during execution. If MesosExecutorBackend is not found, then
 that class shows up in the mesos slave error logs. Can someone confirm if
 this is a valid cause for the problem I am seeing? Any way I can debug
 this
 further?
 
 — Bharath
>>> 
>>> --
>>> Cheers,
>>> Tim
>>> Freedom, Features, Friends, First -> Fedora
>>> https://fedoraproject.org/wiki/SIGs/bigdata
>> 
>> 
> 
> -- 
> Cheers,
> Tim
> Freedom, Features, Friends, First -> Fedora
> https://fedoraproject.org/wiki/SIGs/bigdata



Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Denny Lee
If you have any questions on helping to get a Spark Meetup off the ground, 
please do not hesitate to ping me (denny.g@gmail.com).  I helped jump start 
the one here in Seattle (and tangentially have been helping the Vancouver and 
Denver ones as well).  HTH!


On March 31, 2014 at 12:35:38 PM, Patrick Grinaway (pgrina...@gmail.com) wrote:

Also in NYC, definitely interested in a spark meetup!

Sent from my iPhone

On Mar 31, 2014, at 3:07 PM, Jeremy Freeman  wrote:

Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, 
but am back in NYC quite often, and have been turning several computational 
people at Columbia / NYU / Simons Foundation onto Spark; there'd definitely be 
interest in those communities.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Mar 31, 2014, at 2:31 PM, Yana Kadiyska  wrote:

Nicholas, I'm in Boston and would be interested in a Spark group. Not
sure if you know this -- there was a meetup that never got off the
ground. Anyway, I'd be +1 for attending. Not sure what is involved in
organizing. Seems a shame that a city like Boston doesn't have one.

On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
 wrote:
As in, I am interested in helping organize a Spark meetup in the Boston
area.


On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
 wrote:

Well, since this thread has played out as it has, lemme throw in a
shout-out for Boston.


On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:

We'd love to see a Spark user group in Los Angeles and connect with
others working with it here.

Ping me if you're in the LA area and use Spark at your company (
ch...@retentionscience.com ).

Chris

Retention Science
call: 734.272.3099
visit: Site | like: Facebook | follow: Twitter

On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
wrote:

How about Chicago?


On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:

Montreal or Toronto?


On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson 
wrote:

How about London?


--
Martin Goodson  |  VP Data Science
(0)20 3397 1240



On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski
 wrote:

Hi folks,

We have seen a lot of community growth outside of the Bay Area and we
are looking to help spur even more!

For starters, the organizers of the Spark meetups here in the Bay Area
want to help anybody that is interested in setting up a meetup in a new
city.

Some amazing Spark champions have stepped forward in Seattle,
Vancouver, Boulder/Denver, and a few other areas already.

Right now, we are looking to connect with you Spark enthusiasts in NYC
about helping to run an inaugural Spark Meetup in your area.

You can reply to me directly if you are interested and I can tell you
about all of the resources we have to offer (speakers from the core
community, a budget for food, help scheduling, etc.), and let's make this
happen!

Andy










Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Patrick Grinaway
Also in NYC, definitely interested in a spark meetup!

Sent from my iPhone

> On Mar 31, 2014, at 3:07 PM, Jeremy Freeman  wrote:
> 
> Happy to help with an NYC meet up (just emailed Andy). I recently moved to 
> VA, but am back in NYC quite often, and have been turning several 
> computational people at Columbia / NYU / Simons Foundation onto Spark; 
> there'd definitely be interest in those communities.
> 
> -- Jeremy
> 
> -
> jeremy freeman, phd
> neuroscientist
> @thefreemanlab
> 
>> On Mar 31, 2014, at 2:31 PM, Yana Kadiyska  wrote:
>> 
>> Nicholas, I'm in Boston and would be interested in a Spark group. Not
>> sure if you know this -- there was a meetup that never got off the
>> ground. Anyway, I'd be +1 for attending. Not sure what is involved in
>> organizing. Seems a shame that a city like Boston doesn't have one.
>> 
>> On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
>>  wrote:
>>> As in, I am interested in helping organize a Spark meetup in the Boston
>>> area.
>>> 
>>> 
>>> On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
>>>  wrote:
 
 Well, since this thread has played out as it has, lemme throw in a
 shout-out for Boston.
 
 
> On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:
> 
> We'd love to see a Spark user group in Los Angeles and connect with
> others working with it here.
> 
> Ping me if you're in the LA area and use Spark at your company (
> ch...@retentionscience.com ).
> 
> Chris
> 
> Retention Science
> call: 734.272.3099
> visit: Site | like: Facebook | follow: Twitter
> 
> On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
> wrote:
> 
> How about Chicago?
> 
> 
>> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
>> 
>> Montreal or Toronto?
>> 
>> 
>> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson 
>> wrote:
>>> 
>>> How about London?
>>> 
>>> 
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> 
>>> 
>>> 
>>> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski
>>>  wrote:
 
 Hi folks,
 
 We have seen a lot of community growth outside of the Bay Area and we
 are looking to help spur even more!
 
 For starters, the organizers of the Spark meetups here in the Bay Area
 want to help anybody that is interested in setting up a meetup in a new
 city.
 
 Some amazing Spark champions have stepped forward in Seattle,
 Vancouver, Boulder/Denver, and a few other areas already.
 
 Right now, we are looking to connect with you Spark enthusiasts in NYC
 about helping to run an inaugural Spark Meetup in your area.
 
 You can reply to me directly if you are interested and I can tell you
 about all of the resources we have to offer (speakers from the core
 community, a budget for food, help scheduling, etc.), and let's make 
 this
 happen!
 
 Andy
> 


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Jeremy Freeman
Happy to help with an NYC meet up (just emailed Andy). I recently moved to VA, 
but am back in NYC quite often, and have been turning several computational 
people at Columbia / NYU / Simons Foundation onto Spark; there'd definitely be 
interest in those communities.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Mar 31, 2014, at 2:31 PM, Yana Kadiyska  wrote:

> Nicholas, I'm in Boston and would be interested in a Spark group. Not
> sure if you know this -- there was a meetup that never got off the
> ground. Anyway, I'd be +1 for attending. Not sure what is involved in
> organizing. Seems a shame that a city like Boston doesn't have one.
> 
> On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
>  wrote:
>> As in, I am interested in helping organize a Spark meetup in the Boston
>> area.
>> 
>> 
>> On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
>>  wrote:
>>> 
>>> Well, since this thread has played out as it has, lemme throw in a
>>> shout-out for Boston.
>>> 
>>> 
>>> On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:
 
 We'd love to see a Spark user group in Los Angeles and connect with
 others working with it here.
 
 Ping me if you're in the LA area and use Spark at your company (
 ch...@retentionscience.com ).
 
 Chris
 
 Retention Science
 call: 734.272.3099
 visit: Site | like: Facebook | follow: Twitter
 
 On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
 wrote:
 
 How about Chicago?
 
 
 On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
> 
> Montreal or Toronto?
> 
> 
> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson 
> wrote:
>> 
>> How about London?
>> 
>> 
>> --
>> Martin Goodson  |  VP Data Science
>> (0)20 3397 1240
>> 
>> 
>> 
>> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski
>>  wrote:
>>> 
>>> Hi folks,
>>> 
>>> We have seen a lot of community growth outside of the Bay Area and we
>>> are looking to help spur even more!
>>> 
>>> For starters, the organizers of the Spark meetups here in the Bay Area
>>> want to help anybody that is interested in setting up a meetup in a new
>>> city.
>>> 
>>> Some amazing Spark champions have stepped forward in Seattle,
>>> Vancouver, Boulder/Denver, and a few other areas already.
>>> 
>>> Right now, we are looking to connect with you Spark enthusiasts in NYC
>>> about helping to run an inaugural Spark Meetup in your area.
>>> 
>>> You can reply to me directly if you are interested and I can tell you
>>> about all of the resources we have to offer (speakers from the core
>>> community, a budget for food, help scheduling, etc.), and let's make 
>>> this
>>> happen!
>>> 
>>> Andy
>> 
>> 
> 
 
 
>>> 
>> 



Re: Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nick Pentreath
I would offer to host one in Cape Town but we're almost certainly the only 
Spark users in the country apart from perhaps one in Johanmesburg :)—
Sent from Mailbox for iPhone

On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas
 wrote:

> My fellow Bostonians and New Englanders,
> We cannot allow New York to beat us to having a banging Spark meetup.
> Respond to me (and I guess also Andy?) if you are interested.
> Yana,
> I'm not sure either what is involved in organizing, but we can figure it
> out. I didn't know about the meetup that never took off.
> Nick
> On Mon, Mar 31, 2014 at 2:31 PM, Yana Kadiyska wrote:
>> Nicholas, I'm in Boston and would be interested in a Spark group. Not
>> sure if you know this -- there was a meetup that never got off the
>> ground. Anyway, I'd be +1 for attending. Not sure what is involved in
>> organizing. Seems a shame that a city like Boston doesn't have one.
>>
>> On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
>>  wrote:
>> > As in, I am interested in helping organize a Spark meetup in the Boston
>> > area.
>> >
>> >
>> > On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
>> >  wrote:
>> >>
>> >> Well, since this thread has played out as it has, lemme throw in a
>> >> shout-out for Boston.
>>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Spahk-enthusiasts-in-Boston-tp3544.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nicholas Chammas
My fellow Bostonians and New Englanders,

We cannot allow New York to beat us to having a banging Spark meetup.

Respond to me (and I guess also Andy?) if you are interested.

Yana,

I'm not sure either what is involved in organizing, but we can figure it
out. I didn't know about the meetup that never took off.

Nick


On Mon, Mar 31, 2014 at 2:31 PM, Yana Kadiyska wrote:

> Nicholas, I'm in Boston and would be interested in a Spark group. Not
> sure if you know this -- there was a meetup that never got off the
> ground. Anyway, I'd be +1 for attending. Not sure what is involved in
> organizing. Seems a shame that a city like Boston doesn't have one.
>
> On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
>  wrote:
> > As in, I am interested in helping organize a Spark meetup in the Boston
> > area.
> >
> >
> > On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
> >  wrote:
> >>
> >> Well, since this thread has played out as it has, lemme throw in a
> >> shout-out for Boston.
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Spahk-enthusiasts-in-Boston-tp3544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Yana Kadiyska
Nicholas, I'm in Boston and would be interested in a Spark group. Not
sure if you know this -- there was a meetup that never got off the
ground. Anyway, I'd be +1 for attending. Not sure what is involved in
organizing. Seems a shame that a city like Boston doesn't have one.

On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
 wrote:
> As in, I am interested in helping organize a Spark meetup in the Boston
> area.
>
>
> On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
>  wrote:
>>
>> Well, since this thread has played out as it has, lemme throw in a
>> shout-out for Boston.
>>
>>
>> On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:
>>>
>>> We'd love to see a Spark user group in Los Angeles and connect with
>>> others working with it here.
>>>
>>> Ping me if you're in the LA area and use Spark at your company (
>>> ch...@retentionscience.com ).
>>>
>>> Chris
>>>
>>> Retention Science
>>> call: 734.272.3099
>>> visit: Site | like: Facebook | follow: Twitter
>>>
>>> On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
>>> wrote:
>>>
>>> How about Chicago?
>>>
>>>
>>> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:

 Montreal or Toronto?


 On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson 
 wrote:
>
> How about London?
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> 
>
>
> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski
>  wrote:
>>
>> Hi folks,
>>
>> We have seen a lot of community growth outside of the Bay Area and we
>> are looking to help spur even more!
>>
>> For starters, the organizers of the Spark meetups here in the Bay Area
>> want to help anybody that is interested in setting up a meetup in a new
>> city.
>>
>> Some amazing Spark champions have stepped forward in Seattle,
>> Vancouver, Boulder/Denver, and a few other areas already.
>>
>> Right now, we are looking to connect with you Spark enthusiasts in NYC
>> about helping to run an inaugural Spark Meetup in your area.
>>
>> You can reply to me directly if you are interested and I can tell you
>> about all of the resources we have to offer (speakers from the core
>> community, a budget for food, help scheduling, etc.), and let's make this
>> happen!
>>
>> Andy
>
>

>>>
>>>
>>
>


Re: how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Thanks
-Mo


2014-03-31 13:16 GMT-05:00 Evgeny Shishkin :

>
> On 31 Mar 2014, at 21:05, Dong Mo  wrote:
>
> > Dear list,
> >
> > I was wondering how Spark handles congestion when the upstream is
> generating dstreams faster than downstream workers can handle?
>
> It will eventually OOM.
>
>


Re: how spark dstream handles congestion?

2014-03-31 Thread Evgeny Shishkin

On 31 Mar 2014, at 21:05, Dong Mo  wrote:

> Dear list,
> 
> I was wondering how Spark handles congestion when the upstream is generating 
> dstreams faster than downstream workers can handle?

It will eventually OOM.



how spark dstream handles congestion?

2014-03-31 Thread Dong Mo
Dear list,

I was wondering how Spark handles congestion when the upstream is
generating dstreams faster than downstream workers can handle?

Thanks
-Mo


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread François Le Lay
Hi Andy,

NYC speaking! Pretty sure we can come up with something here.
Let's discuss offline!

François




On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:

> We'd love to see a Spark user group in Los Angeles and connect with others
> working with it here.
>
> Ping me if you're in the LA area and use Spark at your company (
> ch...@retentionscience.com ).
>
> Chris
>
> *Retention** Science*
> call: 734.272.3099
> visit: Site  | like: 
> Facebook |
> follow: Twitter 
>
> On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
> wrote:
>
> How about Chicago?
>
>
> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
>
>> Montreal or Toronto?
>>
>>
>> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote:
>>
>>> How about London?
>>>
>>>
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> 
>>>
>>>
>>> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski >> > wrote:
>>>
 Hi folks,

 We have seen a lot of community growth outside of the Bay Area and we
 are looking to help spur even more!

 For starters, the organizers of the Spark meetups here in the Bay Area
 want to help anybody that is interested in setting up a meetup in a new
 city.

 Some amazing Spark champions have stepped forward in Seattle,
 Vancouver, Boulder/Denver, and a few other areas already.

 Right now, we are looking to connect with you Spark enthusiasts in NYC
 about helping to run an inaugural Spark Meetup in your area.

 You can reply to me directly if you are interested and I can tell you
 about all of the resources we have to offer (speakers from the core
 community, a budget for food, help scheduling, etc.), and let's make this
 happen!

 Andy

>>>
>>>
>>
>
>


-- 
François /fly Le Lay
Data Infra Chapter Lead NYC
+1 (646)-656-0075


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Nicholas Chammas
As in, I am interested in helping organize a Spark meetup in the Boston
area.


On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Well, since this thread has played out as it has, lemme throw in a
> shout-out for Boston.
>
>
> On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:
>
>> We'd love to see a Spark user group in Los Angeles and connect with
>> others working with it here.
>>
>> Ping me if you're in the LA area and use Spark at your company (
>> ch...@retentionscience.com ).
>>
>> Chris
>>
>> *Retention** Science*
>> call: 734.272.3099
>> visit: Site  | like: 
>> Facebook |
>> follow: Twitter 
>>
>> On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
>> wrote:
>>
>> How about Chicago?
>>
>>
>> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
>>
>>> Montreal or Toronto?
>>>
>>>
>>> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote:
>>>
 How about London?


 --
 Martin Goodson  |  VP Data Science
 (0)20 3397 1240
 


 On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski <
 andykonwin...@gmail.com> wrote:

> Hi folks,
>
> We have seen a lot of community growth outside of the Bay Area and we
> are looking to help spur even more!
>
> For starters, the organizers of the Spark meetups here in the Bay Area
> want to help anybody that is interested in setting up a meetup in a new
> city.
>
> Some amazing Spark champions have stepped forward in Seattle,
> Vancouver, Boulder/Denver, and a few other areas already.
>
> Right now, we are looking to connect with you Spark enthusiasts in NYC
> about helping to run an inaugural Spark Meetup in your area.
>
> You can reply to me directly if you are interested and I can tell you
> about all of the resources we have to offer (speakers from the core
> community, a budget for food, help scheduling, etc.), and let's make this
> happen!
>
> Andy
>


>>>
>>
>>
>


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Nicholas Chammas
Well, since this thread has played out as it has, lemme throw in a
shout-out for Boston.


On Mon, Mar 31, 2014 at 1:52 PM, Chris Gore  wrote:

> We'd love to see a Spark user group in Los Angeles and connect with others
> working with it here.
>
> Ping me if you're in the LA area and use Spark at your company (
> ch...@retentionscience.com ).
>
> Chris
>
> *Retention** Science*
> call: 734.272.3099
> visit: Site  | like: 
> Facebook |
> follow: Twitter 
>
> On Mar 31, 2014, at 10:42 AM, Anurag Dodeja 
> wrote:
>
> How about Chicago?
>
>
> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
>
>> Montreal or Toronto?
>>
>>
>> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote:
>>
>>> How about London?
>>>
>>>
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> 
>>>
>>>
>>> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski >> > wrote:
>>>
 Hi folks,

 We have seen a lot of community growth outside of the Bay Area and we
 are looking to help spur even more!

 For starters, the organizers of the Spark meetups here in the Bay Area
 want to help anybody that is interested in setting up a meetup in a new
 city.

 Some amazing Spark champions have stepped forward in Seattle,
 Vancouver, Boulder/Denver, and a few other areas already.

 Right now, we are looking to connect with you Spark enthusiasts in NYC
 about helping to run an inaugural Spark Meetup in your area.

 You can reply to me directly if you are interested and I can tell you
 about all of the resources we have to offer (speakers from the core
 community, a budget for food, help scheduling, etc.), and let's make this
 happen!

 Andy

>>>
>>>
>>
>
>


Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Tim St Clair
It sounds like the protobuf issue. 

So FWIW, You might want to try updating the 0.9.0 w/pom mods for mesos & 
protobuf. 

mesos 0.17.0 & protobuf 2.5   

Cheers,
Tim

- Original Message -
> From: "Bharath Bhushan" 
> To: user@spark.apache.org
> Sent: Monday, March 31, 2014 9:46:32 AM
> Subject: Re: java.lang.ClassNotFoundException - spark on mesos
> 
> I tried 0.9.0 and the latest git tree of spark. For mesos, I tried 0.17.0 and
> the latest git tree.
> 
> Thanks
> 
> 
> On 31-Mar-2014, at 7:24 pm, Tim St Clair  wrote:
> 
> > What versions are you running?
> > 
> > There is a known protobuf 2.5 mismatch, depending on your versions.
> > 
> > Cheers,
> > Tim
> > 
> > - Original Message -
> >> From: "Bharath Bhushan" 
> >> To: user@spark.apache.org
> >> Sent: Monday, March 31, 2014 8:16:19 AM
> >> Subject: java.lang.ClassNotFoundException - spark on mesos
> >> 
> >> I am facing different kinds of java.lang.ClassNotFoundException when
> >> trying
> >> to run spark on mesos. One error has to do with
> >> org.apache.spark.executor.MesosExecutorBackend. Another has to do with
> >> org.apache.spark.serializer.JavaSerializer. I see other people complaining
> >> about similar issues.
> >> 
> >> I tried with different version of spark distribution - 0.9.0 and
> >> 1.0.0-SNAPSHOT and faced the same problem. I think the reason for this is
> >> is
> >> related to the error below.
> >> 
> >> $ jar -xf spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
> >> java.io.IOException: META-INF/license : could not create directory
> >>at sun.tools.jar.Main.extractFile(Main.java:907)
> >>at sun.tools.jar.Main.extract(Main.java:850)
> >>at sun.tools.jar.Main.run(Main.java:240)
> >>at sun.tools.jar.Main.main(Main.java:1147)
> >> 
> >> This error happens with all the jars that I created. But the classes that
> >> are
> >> already generated is different in the different cases. If JavaSerializer
> >> is
> >> not already extracted before encountering META-INF/license, then that
> >> class
> >> is not found during execution. If MesosExecutorBackend is not found, then
> >> that class shows up in the mesos slave error logs. Can someone confirm if
> >> this is a valid cause for the problem I am seeing? Any way I can debug
> >> this
> >> further?
> >> 
> >> — Bharath
> > 
> > --
> > Cheers,
> > Tim
> > Freedom, Features, Friends, First -> Fedora
> > https://fedoraproject.org/wiki/SIGs/bigdata
> 
> 

-- 
Cheers,
Tim
Freedom, Features, Friends, First -> Fedora
https://fedoraproject.org/wiki/SIGs/bigdata


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Chris Gore
We'd love to see a Spark user group in Los Angeles and connect with others 
working with it here.

Ping me if you're in the LA area and use Spark at your company ( 
ch...@retentionscience.com ).

Chris
 
Retention Science
call: 734.272.3099
visit: Site | like: Facebook | follow: Twitter

On Mar 31, 2014, at 10:42 AM, Anurag Dodeja  wrote:

> How about Chicago?
> 
> 
> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
> Montreal or Toronto?
> 
> 
> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson  wrote:
> How about London?
> 
> 
> -- 
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240  
> 
> 
> 
> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski  
> wrote:
> Hi folks,
> 
> We have seen a lot of community growth outside of the Bay Area and we are 
> looking to help spur even more!
> 
> For starters, the organizers of the Spark meetups here in the Bay Area want 
> to help anybody that is interested in setting up a meetup in a new city.
> 
> Some amazing Spark champions have stepped forward in Seattle, Vancouver, 
> Boulder/Denver, and a few other areas already.
> 
> Right now, we are looking to connect with you Spark enthusiasts in NYC about 
> helping to run an inaugural Spark Meetup in your area.
> 
> You can reply to me directly if you are interested and I can tell you about 
> all of the resources we have to offer (speakers from the core community, a 
> budget for food, help scheduling, etc.), and let's make this happen!
> 
> Andy
> 
> 
> 



Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Andy Konwinski
Responses about London, Montreal/Toronto, DC, Chicago. Great coverage so
far, and keep 'em coming! (still looking for an NYC connection)

I'll reply to each of you off-list to coordinate next-steps for setting up
a Spark meetup in your home area.

Thanks again, this is super exciting.

Andy


On Mon, Mar 31, 2014 at 10:42 AM, Anurag Dodeja wrote:

> How about Chicago?
>
>
> On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:
>
>> Montreal or Toronto?
>>
>>
>> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote:
>>
>>> How about London?
>>>
>>>
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> [image: Inline image 1]
>>>
>>>
>>> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski >> > wrote:
>>>
 Hi folks,

 We have seen a lot of community growth outside of the Bay Area and we
 are looking to help spur even more!

 For starters, the organizers of the Spark meetups here in the Bay Area
 want to help anybody that is interested in setting up a meetup in a new
 city.

 Some amazing Spark champions have stepped forward in Seattle,
 Vancouver, Boulder/Denver, and a few other areas already.

 Right now, we are looking to connect with you Spark enthusiasts in NYC
 about helping to run an inaugural Spark Meetup in your area.

 You can reply to me directly if you are interested and I can tell you
 about all of the resources we have to offer (speakers from the core
 community, a budget for food, help scheduling, etc.), and let's make this
 happen!

 Andy

>>>
>>>
>>
>
<>

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Anurag Dodeja
How about Chicago?


On Mon, Mar 31, 2014 at 12:38 PM, Nan Zhu  wrote:

> Montreal or Toronto?
>
>
> On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote:
>
>> How about London?
>>
>>
>> --
>> Martin Goodson  |  VP Data Science
>> (0)20 3397 1240
>> [image: Inline image 1]
>>
>>
>> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski 
>> wrote:
>>
>>> Hi folks,
>>>
>>> We have seen a lot of community growth outside of the Bay Area and we
>>> are looking to help spur even more!
>>>
>>> For starters, the organizers of the Spark meetups here in the Bay Area
>>> want to help anybody that is interested in setting up a meetup in a new
>>> city.
>>>
>>> Some amazing Spark champions have stepped forward in Seattle, Vancouver,
>>> Boulder/Denver, and a few other areas already.
>>>
>>> Right now, we are looking to connect with you Spark enthusiasts in NYC
>>> about helping to run an inaugural Spark Meetup in your area.
>>>
>>> You can reply to me directly if you are interested and I can tell you
>>> about all of the resources we have to offer (speakers from the core
>>> community, a budget for food, help scheduling, etc.), and let's make this
>>> happen!
>>>
>>> Andy
>>>
>>
>>
>
<>

Re: network wordcount example

2014-03-31 Thread Diana Carroll
Not sure what data you are sending in.  You could try calling
"lines.print()" instead which should just output everything that comes in
on the stream.  Just to test that your socket is receiving what you think
you are sending.


On Mon, Mar 31, 2014 at 12:18 PM, eric perler wrote:

> Hello
>
> i just started working with spark today... and i am trying to run the
> wordcount network example
>
> i created a socket server and client.. and i am sending data to the server
> in an infinite loop
>
> when i run the spark class.. i see this output in the console...
>
> ---
> Time: 1396281891000 ms
> ---
>
> 14/03/31 11:04:51 INFO SparkContext: Job finished: take at
> DStream.scala:586, took 0.056794606 s
> 14/03/31 11:04:51 INFO JobScheduler: Finished job streaming job
> 1396281891000 ms.0 from job set of time 1396281891000 ms
> 14/03/31 11:04:51 INFO JobScheduler: Total delay: 0.101 s for time
> 1396281891000 ms (execution: 0.058 s)
> 14/03/31 11:04:51 INFO TaskSchedulerImpl: Remove TaskSet 3.0 from pool
>
> but i dont see any output from the workcount operation when i make this
> call...
>
> wordCounts.print();
>
> any help is greatly appreciated
>
> thanks in advance
>


Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Nan Zhu
Montreal or Toronto?


On Mon, Mar 31, 2014 at 1:36 PM, Martin Goodson wrote:

> How about London?
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>
>
> On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski 
> wrote:
>
>> Hi folks,
>>
>> We have seen a lot of community growth outside of the Bay Area and we are
>> looking to help spur even more!
>>
>> For starters, the organizers of the Spark meetups here in the Bay Area
>> want to help anybody that is interested in setting up a meetup in a new
>> city.
>>
>> Some amazing Spark champions have stepped forward in Seattle, Vancouver,
>> Boulder/Denver, and a few other areas already.
>>
>> Right now, we are looking to connect with you Spark enthusiasts in NYC
>> about helping to run an inaugural Spark Meetup in your area.
>>
>> You can reply to me directly if you are interested and I can tell you
>> about all of the resources we have to offer (speakers from the core
>> community, a budget for food, help scheduling, etc.), and let's make this
>> happen!
>>
>> Andy
>>
>
>
<>

Re: Calling Spark enthusiasts in NYC

2014-03-31 Thread Martin Goodson
How about London?


-- 
Martin Goodson  |  VP Data Science
(0)20 3397 1240
[image: Inline image 1]


On Mon, Mar 31, 2014 at 6:28 PM, Andy Konwinski wrote:

> Hi folks,
>
> We have seen a lot of community growth outside of the Bay Area and we are
> looking to help spur even more!
>
> For starters, the organizers of the Spark meetups here in the Bay Area
> want to help anybody that is interested in setting up a meetup in a new
> city.
>
> Some amazing Spark champions have stepped forward in Seattle, Vancouver,
> Boulder/Denver, and a few other areas already.
>
> Right now, we are looking to connect with you Spark enthusiasts in NYC
> about helping to run an inaugural Spark Meetup in your area.
>
> You can reply to me directly if you are interested and I can tell you
> about all of the resources we have to offer (speakers from the core
> community, a budget for food, help scheduling, etc.), and let's make this
> happen!
>
> Andy
>
<>

Calling Spark enthusiasts in NYC

2014-03-31 Thread Andy Konwinski
Hi folks,

We have seen a lot of community growth outside of the Bay Area and we are
looking to help spur even more!

For starters, the organizers of the Spark meetups here in the Bay Area want
to help anybody that is interested in setting up a meetup in a new city.

Some amazing Spark champions have stepped forward in Seattle, Vancouver,
Boulder/Denver, and a few other areas already.

Right now, we are looking to connect with you Spark enthusiasts in NYC
about helping to run an inaugural Spark Meetup in your area.

You can reply to me directly if you are interested and I can tell you about
all of the resources we have to offer (speakers from the core community, a
budget for food, help scheduling, etc.), and let's make this happen!

Andy


Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
OK sweet. Thanks for walking me through that.

I wish this were StackOverflow so I could bestow some nice rep on all you
helpful people.


On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson  wrote:

> Note that you may have minSplits set to more than the number of cores in
> the cluster, and Spark will just run as many as possible at a time. This is
> better if certain nodes may be slow, for instance.
>
> In general, it is not necessarily the case that doubling the number of
> cores doing IO will double the throughput, because you could be saturating
> the throughput with fewer cores. However, S3 is odd in that each connection
> gets way less bandwidth than your network link can provide, and it does
> seem to scale linearly with the number of connections. So, yes, taking
> minSplits up to 4 (or higher) will likely result in a 2x performance
> improvement.
>
> saveAsTextFile() will use as many partitions (aka splits) as the RDD it's
> being called on. So for instance:
>
> sc.textFile(myInputFile, 15).map(lambda x: x +
> "!!!").saveAsTextFile(myOutputFile)
>
> will use 15 partitions to read the text file (i.e., up to 15 cores at a
> time) and then again to save back to S3.
>
>
>
> On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> So setting 
>> minSplits
>>  will
>> set the parallelism on the read in SparkContext.textFile(), assuming I have
>> the cores in the cluster to deliver that level of parallelism. And if I
>> don't explicitly provide it, Spark will set the minSplits to 2.
>>
>> So for example, say I have a cluster with 4 cores total, and it takes 40
>> minutes to read a single file from S3 with minSplits at 2. Tt should take
>> roughly 20 minutes to read the same file if I up minSplits to 4.
>>
>> Did I understand that correctly?
>>
>> RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing
>> that's not an operation the user can tune.
>>
>>
>> On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson wrote:
>>
>>> Spark will only use each core for one task at a time, so doing
>>>
>>> sc.textFile(, )
>>>
>>> where you set "num reducers" to at least as many as the total number of
>>> cores in your cluster, is about as fast you can get out of the box. Same
>>> goes for saveAsTextFile.
>>>
>>>
>>> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Howdy-doody,

 I have a single, very large file sitting in S3 that I want to read in
 with sc.textFile(). What are the best practices for reading in this file as
 quickly as possible? How do I parallelize the read as much as possible?

 Similarly, say I have a single, very large RDD sitting in memory that I
 want to write out to S3 with RDD.saveAsTextFile(). What are the best
 practices for writing this file out as quickly as possible?

 Nick


 --
 View this message in context: Best practices: Parallelized write to /
 read from 
 S3
 Sent from the Apache Spark User List mailing list 
 archiveat Nabble.com.

>>>
>>>
>>
>


Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Aaron Davidson
Note that you may have minSplits set to more than the number of cores in
the cluster, and Spark will just run as many as possible at a time. This is
better if certain nodes may be slow, for instance.

In general, it is not necessarily the case that doubling the number of
cores doing IO will double the throughput, because you could be saturating
the throughput with fewer cores. However, S3 is odd in that each connection
gets way less bandwidth than your network link can provide, and it does
seem to scale linearly with the number of connections. So, yes, taking
minSplits up to 4 (or higher) will likely result in a 2x performance
improvement.

saveAsTextFile() will use as many partitions (aka splits) as the RDD it's
being called on. So for instance:

sc.textFile(myInputFile, 15).map(lambda x: x +
"!!!").saveAsTextFile(myOutputFile)

will use 15 partitions to read the text file (i.e., up to 15 cores at a
time) and then again to save back to S3.



On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> So setting 
> minSplits
>  will
> set the parallelism on the read in SparkContext.textFile(), assuming I have
> the cores in the cluster to deliver that level of parallelism. And if I
> don't explicitly provide it, Spark will set the minSplits to 2.
>
> So for example, say I have a cluster with 4 cores total, and it takes 40
> minutes to read a single file from S3 with minSplits at 2. Tt should take
> roughly 20 minutes to read the same file if I up minSplits to 4.
>
> Did I understand that correctly?
>
> RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing
> that's not an operation the user can tune.
>
>
> On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson wrote:
>
>> Spark will only use each core for one task at a time, so doing
>>
>> sc.textFile(, )
>>
>> where you set "num reducers" to at least as many as the total number of
>> cores in your cluster, is about as fast you can get out of the box. Same
>> goes for saveAsTextFile.
>>
>>
>> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Howdy-doody,
>>>
>>> I have a single, very large file sitting in S3 that I want to read in
>>> with sc.textFile(). What are the best practices for reading in this file as
>>> quickly as possible? How do I parallelize the read as much as possible?
>>>
>>> Similarly, say I have a single, very large RDD sitting in memory that I
>>> want to write out to S3 with RDD.saveAsTextFile(). What are the best
>>> practices for writing this file out as quickly as possible?
>>>
>>> Nick
>>>
>>>
>>> --
>>> View this message in context: Best practices: Parallelized write to /
>>> read from 
>>> S3
>>> Sent from the Apache Spark User List mailing list 
>>> archiveat Nabble.com.
>>>
>>
>>
>


Re: Error in SparkSQL Example

2014-03-31 Thread Michael Armbrust
> Thanks for the clarification. My question is about the error above "error:
> class $iwC needs to be abstract"
>

This is a fairly confusing scala REPL (interpreter) error.  Under the
covers, to run the line you entered into the interpreter, scala is creating
an object called $iwC with your code inserted into it.  So this error is
telling you that you cannot create a val in an object (or in a line of the
REPL) without giving it a value.

and what does the RDD brings, since I can do the DSL without the
"people: people:
> org.apache.spark.rdd.RDD[Person]"
>

This is just assigning the type for the variable people (which is really
only there for people reading the code in this particular example).  In the
case of the line from the first example where we leave it out, the scala
compiler will add it for us using type
inference
.


Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
So setting 
minSplits
will
set the parallelism on the read in SparkContext.textFile(), assuming I have
the cores in the cluster to deliver that level of parallelism. And if I
don't explicitly provide it, Spark will set the minSplits to 2.

So for example, say I have a cluster with 4 cores total, and it takes 40
minutes to read a single file from S3 with minSplits at 2. Tt should take
roughly 20 minutes to read the same file if I up minSplits to 4.

Did I understand that correctly?

RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing
that's not an operation the user can tune.


On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson  wrote:

> Spark will only use each core for one task at a time, so doing
>
> sc.textFile(, )
>
> where you set "num reducers" to at least as many as the total number of
> cores in your cluster, is about as fast you can get out of the box. Same
> goes for saveAsTextFile.
>
>
> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Howdy-doody,
>>
>> I have a single, very large file sitting in S3 that I want to read in
>> with sc.textFile(). What are the best practices for reading in this file as
>> quickly as possible? How do I parallelize the read as much as possible?
>>
>> Similarly, say I have a single, very large RDD sitting in memory that I
>> want to write out to S3 with RDD.saveAsTextFile(). What are the best
>> practices for writing this file out as quickly as possible?
>>
>> Nick
>>
>>
>> --
>> View this message in context: Best practices: Parallelized write to /
>> read from 
>> S3
>> Sent from the Apache Spark User List mailing list 
>> archiveat Nabble.com.
>>
>
>


Re: Best practices: Parallelized write to / read from S3

2014-03-31 Thread Aaron Davidson
Spark will only use each core for one task at a time, so doing

sc.textFile(, )

where you set "num reducers" to at least as many as the total number of
cores in your cluster, is about as fast you can get out of the box. Same
goes for saveAsTextFile.


On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Howdy-doody,
>
> I have a single, very large file sitting in S3 that I want to read in with
> sc.textFile(). What are the best practices for reading in this file as
> quickly as possible? How do I parallelize the read as much as possible?
>
> Similarly, say I have a single, very large RDD sitting in memory that I
> want to write out to S3 with RDD.saveAsTextFile(). What are the best
> practices for writing this file out as quickly as possible?
>
> Nick
>
>
> --
> View this message in context: Best practices: Parallelized write to /
> read from 
> S3
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Manoj Samel
Thanks, that works.

It wasn't clear if the second part is just the aggregate specification or
any expression.


On Mon, Mar 31, 2014 at 9:03 AM, Michael Armbrust wrote:

> This is similar to how SQL works, items in the GROUP BY clause are not
> included in the output by default.  You will need to include 'a in the
> second parameter list (which is similar to the SELECT clause) as well if
> you want it included in the output.
>
>
> On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel wrote:
>
>> Hi,
>>
>> If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the
>> resulting RDD should have 'a, 'foo and 'bar.
>>
>> The result RDD just shows 'foo and 'bar and is missing 'a
>>
>> Thoughts?
>>
>> Thanks,
>>
>> Manoj
>>
>
>


Re: Error in SparkSQL Example

2014-03-31 Thread Manoj Samel
Hi Michael,

Thanks for the clarification. My question is about the error above "error:
class $iwC needs to be abstract" and what does the RDD brings, since I can
do the DSL without the "people: people: org.apache.spark.rdd.RDD[Person]"

Thanks,


On Mon, Mar 31, 2014 at 9:13 AM, Michael Armbrust wrote:

> "val people: RDD[Person] // An RDD of case class objects, from the first
> example." is just a placeholder to avoid cluttering up each example with
> the same code for creating an RDD.  The ": RDD[People]" is just there to
> let you know the expected type of the variable 'people'.  Perhaps there is
> a clearer way to indicate this.
>
> As you have realized, using the full line from the first example will
> allow you to run the rest of them.
>
>
>
> On Sun, Mar 30, 2014 at 7:31 AM, Manoj Samel wrote:
>
>> Hi,
>>
>> On
>> http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html,
>> I am trying to run code on "Writing Language-Integrated Relational Queries"
>> ( I have 1.0.0 Snapshot ).
>>
>> I am running into error on
>>
>> val people: RDD[Person] // An RDD of case class objects, from the first
>> example.
>>
>> scala> val people: RDD[Person]
>> :19: error: not found: type RDD
>>val people: RDD[Person]
>>^
>>
>> scala> val people: org.apache.spark.rdd.RDD[Person]
>> :18: error: class $iwC needs to be abstract, since value people
>> is not defined
>> class $iwC extends Serializable {
>>   ^
>>
>> Any idea what the issue is ?
>>
>> Also, its not clear what does the RDD[Person] brings. I can run the DSL
>> without the case class objects RDD ...
>>
>> val people =
>> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p
>> => Person(p(0), p(1).trim.toInt))
>>
>> val teenagers = people.where('age >= 13).where('age <= 19)
>>
>> Thanks,
>>
>>
>>
>>
>


network wordcount example

2014-03-31 Thread eric perler
Hello
i just started working with spark today... and i am trying to run the wordcount 
network example
i created a socket server and client.. and i am sending data to the server in 
an infinite loop
when i run the spark class.. i see this output in the console...
---Time: 1396281891000 
ms---14/03/31 11:04:51 INFO 
SparkContext: Job finished: take at DStream.scala:586, took 0.056794606 
s14/03/31 11:04:51 INFO JobScheduler: Finished job streaming job 1396281891000 
ms.0 from job set of time 1396281891000 ms14/03/31 11:04:51 INFO JobScheduler: 
Total delay: 0.101 s for time 1396281891000 ms (execution: 0.058 s)14/03/31 
11:04:51 INFO TaskSchedulerImpl: Remove TaskSet 3.0 from pool but i dont see 
any output from the workcount operation when i make this call...
wordCounts.print();any help is greatly appreciated
thanks in advance 

Re: Error in SparkSQL Example

2014-03-31 Thread Michael Armbrust
"val people: RDD[Person] // An RDD of case class objects, from the first
example." is just a placeholder to avoid cluttering up each example with
the same code for creating an RDD.  The ": RDD[People]" is just there to
let you know the expected type of the variable 'people'.  Perhaps there is
a clearer way to indicate this.

As you have realized, using the full line from the first example will allow
you to run the rest of them.



On Sun, Mar 30, 2014 at 7:31 AM, Manoj Samel wrote:

> Hi,
>
> On
> http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html,
> I am trying to run code on "Writing Language-Integrated Relational Queries"
> ( I have 1.0.0 Snapshot ).
>
> I am running into error on
>
> val people: RDD[Person] // An RDD of case class objects, from the first
> example.
>
> scala> val people: RDD[Person]
> :19: error: not found: type RDD
>val people: RDD[Person]
>^
>
> scala> val people: org.apache.spark.rdd.RDD[Person]
> :18: error: class $iwC needs to be abstract, since value people
> is not defined
> class $iwC extends Serializable {
>   ^
>
> Any idea what the issue is ?
>
> Also, its not clear what does the RDD[Person] brings. I can run the DSL
> without the case class objects RDD ...
>
> val people =
> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p
> => Person(p(0), p(1).trim.toInt))
>
> val teenagers = people.where('age >= 13).where('age <= 19)
>
> Thanks,
>
>
>
>


Re: SparkSQL "where" with BigDecimal type gives stacktrace

2014-03-31 Thread Michael Armbrust
This was not intentional, here is a JIRA
https://issues.apache.org/jira/browse/SPARK-1364

Note that you can create big decimals by using the Decimal type in a
HiveContext.

Date is not yet a supported data type.


On Sun, Mar 30, 2014 at 5:35 PM, Manoj Samel wrote:

> Hi,
>
> Would the same issue be present for other Java type like Date ?
>
> Converting the person/teenager example on Patricks page reproduces the
> problem ...
>
> Thanks,
>
>
> scala> import scala.math
> import scala.math
>
> scala> case class Person(name: String, age: BigDecimal)
>  defined class Person
>
> scala> val people =
> sc.textFile("/data/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(p
> => Person(p(0), BigDecimal(p(1).trim.toInt)))
> 14/03/31 00:23:40 INFO MemoryStore: ensureFreeSpace(32960) called with
> curMem=0, maxMem=308713881
> 14/03/31 00:23:40 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 32.2 KB, free 294.4 MB)
> people: org.apache.spark.rdd.RDD[Person] = MappedRDD[3] at map at
> :20
>
> scala> people take 1
> ...
>
> scala> val t = people.where('age > 12 )
> scala.MatchError: scala.BigDecimal (of class
> scala.reflect.internal.Types$TypeRef$$anon$3)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:41)
>  at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:45)
>  at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>  at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:45)
>  at
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:38)
> at
> org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:32)
>  at
> org.apache.spark.sql.execution.ExistingRdd$.fromProductRdd(basicOperators.scala:128)
> at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:79)
>  at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:22)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:27)
>  at $iwC$$iwC$$iwC$$iwC.(:29)
> at $iwC$$iwC$$iwC.(:31)
>  at $iwC$$iwC.(:33)
> at $iwC.(:35)
> at (:37)
>  at .(:41)
> at .()
> at .(:7)
>  at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:601)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:777)
>  at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1045)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
>  at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
>  at
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:795)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:840)
>  at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:752)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:600)
>  at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:607)
> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:610)
>  at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:935)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
>  at
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:883)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>  at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:883)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:981)
>  at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
>
>
>
> On Sun, Mar 30, 2014 at 11:04 AM, Aaron Davidson wrote:
>
>> Well, the error is coming from this case statement not matching on the
>> BigDecimal type:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L41
>>
>> This seems to be a bug because there is a corresponding Catalyst DataType
>> for BigDecimal, just no way to produce a schema for it. A patch should be
>> straightforward enough to match against typeOf[BigDecimal] assuming this
>> was not for some reason intentional.
>>
>>
>> On Sun, Mar 30, 2014 at 10:43 AM, smallmonkey...@hotmail.com <
>> smallmonkey...@hotmail.com> wrote:
>>
>>>  can

Re: groupBy RDD does not have grouping column ?

2014-03-31 Thread Michael Armbrust
This is similar to how SQL works, items in the GROUP BY clause are not
included in the output by default.  You will need to include 'a in the
second parameter list (which is similar to the SELECT clause) as well if
you want it included in the output.


On Sun, Mar 30, 2014 at 9:52 PM, Manoj Samel wrote:

> Hi,
>
> If I create a groupBy('a)(Sum('b) as 'foo, Sum('c) as 'bar), then the
> resulting RDD should have 'a, 'foo and 'bar.
>
> The result RDD just shows 'foo and 'bar and is missing 'a
>
> Thoughts?
>
> Thanks,
>
> Manoj
>


Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-31 Thread Michael Armbrust
>
> * unionAll preserve duplicate v/s union that does not
>

This is true, if you want to eliminate duplicate items you should follow
the union with a distinct()


> * SQL union and unionAll result in same output format i.e. another SQL v/s
> different RDD types here.
>
* Understand the existing union contract issue. This may be a class
> hierarchy discussion for SchemaRDD, UnionRDD etc. ?
>

This is unfortunately going to be a limitation of the query DSL since it
extends standard RDDs.  It is not possible for us to return specialized
types from functions that are already defined in RDD (such as union) as the
base RDD class has a very opaque notion of schema, and at this point the
API for RDDs is very fixed.  If you use SQL however, you will always get
back SchemaRDDs.


Best practices: Parallelized write to / read from S3

2014-03-31 Thread Nicholas Chammas
Howdy-doody,

I have a single, very large file sitting in S3 that I want to read in with
sc.textFile(). What are the best practices for reading in this file as
quickly as possible? How do I parallelize the read as much as possible?

Similarly, say I have a single, very large RDD sitting in memory that I
want to write out to S3 with RDD.saveAsTextFile(). What are the best
practices for writing this file out as quickly as possible?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I tried 0.9.0 and the latest git tree of spark. For mesos, I tried 0.17.0 and 
the latest git tree.

Thanks


On 31-Mar-2014, at 7:24 pm, Tim St Clair  wrote:

> What versions are you running?  
> 
> There is a known protobuf 2.5 mismatch, depending on your versions. 
> 
> Cheers,
> Tim
> 
> - Original Message -
>> From: "Bharath Bhushan" 
>> To: user@spark.apache.org
>> Sent: Monday, March 31, 2014 8:16:19 AM
>> Subject: java.lang.ClassNotFoundException - spark on mesos
>> 
>> I am facing different kinds of java.lang.ClassNotFoundException when trying
>> to run spark on mesos. One error has to do with
>> org.apache.spark.executor.MesosExecutorBackend. Another has to do with
>> org.apache.spark.serializer.JavaSerializer. I see other people complaining
>> about similar issues.
>> 
>> I tried with different version of spark distribution - 0.9.0 and
>> 1.0.0-SNAPSHOT and faced the same problem. I think the reason for this is is
>> related to the error below.
>> 
>> $ jar -xf spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
>> java.io.IOException: META-INF/license : could not create directory
>>at sun.tools.jar.Main.extractFile(Main.java:907)
>>at sun.tools.jar.Main.extract(Main.java:850)
>>at sun.tools.jar.Main.run(Main.java:240)
>>at sun.tools.jar.Main.main(Main.java:1147)
>> 
>> This error happens with all the jars that I created. But the classes that are
>> already generated is different in the different cases. If JavaSerializer is
>> not already extracted before encountering META-INF/license, then that class
>> is not found during execution. If MesosExecutorBackend is not found, then
>> that class shows up in the mesos slave error logs. Can someone confirm if
>> this is a valid cause for the problem I am seeing? Any way I can debug this
>> further?
>> 
>> — Bharath
> 
> -- 
> Cheers,
> Tim
> Freedom, Features, Friends, First -> Fedora
> https://fedoraproject.org/wiki/SIGs/bigdata



yarn.application.classpath in yarn-site.xml

2014-03-31 Thread Dan
Hi,

I've just tested spark in yarn mode, but something made me confused.

When I *delete* the "yarn.application.classpath" configuration in
yarn-site.xml, the following command works well.
*bin/spark-class org.apache.spark.deploy.yarn.Client --jar
examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar
--class org.apache.spark.examples.SparkPi --args yarn-standalone
--num-worker 3*

However, when I configures it as follows, yarnAppState has always kept in
the *ACCEPTED state*. The application has no tend to stop.

yarn.application.classpath
$HADOOP_HOME/etc/hadoop/conf,

 $HADOOP_HOME/share/hadoop/common/*,$HADOOP_HOME/share/hadoop/common/lib/*,

 $HADOOP_HOME/share/hadoop/hdfs/*,$HADOOP_HOME/share/hadoop/hdfs/lib/*,

$HADOOP_HOME/share/hadoop/mapreduce/*,$HADOOP_HOME/share/hadoop/mapreduce/lib/*,

 $HADOOP_HOME/share/hadoop/yarn/*,$HADOOP_HOME/share/hadoop/yarn/lib/*



Hadoop version is 2.2.0 and the cluster has one master and three workers.

Does anyone have ideas about this problem?

Thanks,
Dan




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-application-classpath-in-yarn-site-xml-tp3512.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Tim St Clair
What versions are you running?  

There is a known protobuf 2.5 mismatch, depending on your versions. 

Cheers,
Tim

- Original Message -
> From: "Bharath Bhushan" 
> To: user@spark.apache.org
> Sent: Monday, March 31, 2014 8:16:19 AM
> Subject: java.lang.ClassNotFoundException - spark on mesos
> 
> I am facing different kinds of java.lang.ClassNotFoundException when trying
> to run spark on mesos. One error has to do with
> org.apache.spark.executor.MesosExecutorBackend. Another has to do with
> org.apache.spark.serializer.JavaSerializer. I see other people complaining
> about similar issues.
> 
> I tried with different version of spark distribution - 0.9.0 and
> 1.0.0-SNAPSHOT and faced the same problem. I think the reason for this is is
> related to the error below.
> 
> $ jar -xf spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
> java.io.IOException: META-INF/license : could not create directory
> at sun.tools.jar.Main.extractFile(Main.java:907)
> at sun.tools.jar.Main.extract(Main.java:850)
> at sun.tools.jar.Main.run(Main.java:240)
> at sun.tools.jar.Main.main(Main.java:1147)
> 
> This error happens with all the jars that I created. But the classes that are
> already generated is different in the different cases. If JavaSerializer is
> not already extracted before encountering META-INF/license, then that class
> is not found during execution. If MesosExecutorBackend is not found, then
> that class shows up in the mesos slave error logs. Can someone confirm if
> this is a valid cause for the problem I am seeing? Any way I can debug this
> further?
> 
> — Bharath

-- 
Cheers,
Tim
Freedom, Features, Friends, First -> Fedora
https://fedoraproject.org/wiki/SIGs/bigdata


java.lang.ClassNotFoundException - spark on mesos

2014-03-31 Thread Bharath Bhushan
I am facing different kinds of java.lang.ClassNotFoundException when trying to 
run spark on mesos. One error has to do with 
org.apache.spark.executor.MesosExecutorBackend. Another has to do with 
org.apache.spark.serializer.JavaSerializer. I see other people complaining 
about similar issues.

I tried with different version of spark distribution - 0.9.0 and 1.0.0-SNAPSHOT 
and faced the same problem. I think the reason for this is is related to the 
error below.

$ jar -xf spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
java.io.IOException: META-INF/license : could not create directory
at sun.tools.jar.Main.extractFile(Main.java:907)
at sun.tools.jar.Main.extract(Main.java:850)
at sun.tools.jar.Main.run(Main.java:240)
at sun.tools.jar.Main.main(Main.java:1147)

This error happens with all the jars that I created. But the classes that are 
already generated is different in the different cases. If JavaSerializer is not 
already extracted before encountering META-INF/license, then that class is not 
found during execution. If MesosExecutorBackend is not found, then that class 
shows up in the mesos slave error logs. Can someone confirm if this is a valid 
cause for the problem I am seeing? Any way I can debug this further?

— Bharath

Re: SequenceFileRDDFunctions cannot be used output of spark package

2014-03-31 Thread pradeeps8
Hi Sonal,

There are no custom objects in saveRDD, it is of type RDD[(String, String)].

Thanks,
Pradeep 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SequenceFileRDDFunctions-cannot-be-used-output-of-spark-package-tp250p3508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Task not serializable?

2014-03-31 Thread Daniel Liu
Hi

I am new to Spark and I encountered this error when I try to map RDD[A] =>
RDD[Array[Double]] then collect the results.

A is a custom class extends Serializable. (Actually it's just a wrapper
class which wraps a few variables that are all serializable).

I also tried KryoSerializer according to this guide
http://spark.apache.org/docs/0.8.1/tuning.html and it gave the same error
message.

Daniel Liu