Re: module not found: org.eclipse.paho#mqtt-client;0.4.0

2014-04-04 Thread Sean Owen
This ultimately means a problem with SSL in the version of Java you
are using to run SBT. If you look around the internet, you'll see a
bunch of discussion, most of which seems to boil down to reinstall, or
update, Java.
--
Sean Owen | Director, Data Science | London


On Fri, Apr 4, 2014 at 12:21 PM, Dear all alicksc...@163.com wrote:

 hello, all

  i am a new guy to sparkscala.

  Yestday i install spark failed,  and the message like this:

   who can help me: why the matt-client-0.4.0.pom can't find? how should
 i do ?

  thanks a lot!

 command: sbt/sbt assembly

 [info] Updating
 {file:/Users/alick/spark/spark-0.9.0-incubating/}external-mqtt...

 [info] Resolving org.eclipse.paho#mqtt-client;0.4.0 ...

 [error] Server access Error: java.lang.RuntimeException: Unexpected error:
 java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
 must be non-empty
 url=https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

 [error] Server access Error: java.lang.RuntimeException: Unexpected error:
 java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
 must be non-empty
 url=https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

 [error] Server access Error: java.lang.RuntimeException: Unexpected error:
 java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
 must be non-empty
 url=https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

 [warn] module not found: org.eclipse.paho#mqtt-client;0.4.0

 [warn]  local: tried

 [warn]
 /Users/alick/.ivy2/local/org.eclipse.paho/mqtt-client/0.4.0/ivys/ivy.xml

 [warn]  Local Maven Repo: tried

 [warn]  sonatype-snapshots: tried

 [warn]
 https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

 [warn]  sonatype-staging: tried

 [warn]
 https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

 [warn]  Eclipse Repo: tried

 [warn]
 https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom

 [warn]  public: tried

 [warn]
 http://repo1.maven.org/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom









Explain Add Input

2014-04-04 Thread Eduardo Costa Alfaia

Hi all,

Could anyone explain me about the lines below?

computer1 - worker
computer8 - driver(master)

14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 
KB, free: 540.3 MB)
14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(1292780) called with 
curMem=49555672, maxMem=825439027
14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314800 stored 
as bytes to memory (size 1262.5 KB, free 738.7 MB)
14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314800 in memory on computer8.ant-net:49743 (size: 1262.5 
KB, free: 738.7 MB)


Why does spark add the same input in computer8, which is the Driver(master)?

Thanks guys

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


RAM high consume

2014-04-04 Thread Eduardo Costa Alfaia

 Hi all,

I am doing some tests using JavaNetworkWordcount and I have some 
questions about the performance machine, my tests' time are 
approximately 2 min.


Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 
machines and I had gotten the same behavior.


What should I do to get a better performance in this case?


# Star Test

computer1
 total   used   free sharedbuffers cached
Mem:  3945711 3233 0  3430
-/+ buffers/cache:276   3668
Swap:0  0  0

 14:42:50 up 73 days,  3:32,  2 users,  load average: 0.00, 0.06, 0.21


14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314400 in memory on computer1.ant-net:60820 (size: 826.1 
KB, free: 542.9 MB)
14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(845956) called with 
curMem=47278100, maxMem=825439027
14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314400 stored 
as bytes to memory (size 826.1 KB, free 741.3 MB)
14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614314400 in memory on computer8.ant-net:49743 (size: 826.1 
KB, free: 741.3 MB)
14/04/04 14:24:56 INFO BlockManagerMaster: Updated info of block 
input-0-1396614314400
14/04/04 14:24:56 INFO TaskSetManager: Finished TID 272 in 84 ms on 
computer1.ant-net (progress: 0/1)

14/04/04 14:24:56 INFO TaskSchedulerImpl: Remove TaskSet 43.0 from pool
14/04/04 14:24:56 INFO DAGScheduler: Completed ResultTask(43, 0)
14/04/04 14:24:56 INFO DAGScheduler: Stage 43 (take at 
DStream.scala:594) finished in 0.088 s
14/04/04 14:24:56 INFO SparkContext: Job finished: take at 
DStream.scala:594, took 1.872875734 s

---
Time: 1396614289000 ms
---
(Santiago,1)
(liveliness,1)
(Sun,1)
(reapers,1)
(offer,3)
(BARBER,3)
(shrewdness,1)
(truism,1)
(hits,1)
(merchant,1)



# End Test
computer1
 total   used   free sharedbuffers cached
Mem:  3945   2209 1735 0  5773
-/+ buffers/cache:   1430   2514
Swap:0  0  0

 14:46:05 up 73 days,  3:35,  2 users,  load average: 2.69, 1.07, 0.55


14/04/04 14:26:57 INFO TaskSetManager: Starting task 183.0:0 as TID 696 
on executor 0: computer1.ant-net (PROCESS_LOCAL)
14/04/04 14:26:57 INFO TaskSetManager: Serialized task 183.0:0 as 1981 
bytes in 0 ms
14/04/04 14:26:57 INFO MapOutputTrackerMasterActor: Asked to send map 
output locations for shuffle 81 to sp...@computer1.ant-net:44817
14/04/04 14:26:57 INFO MapOutputTrackerMaster: Size of output statuses 
for shuffle 81 is 212 bytes
14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614336600 on disk on computer1.ant-net:60820 (size: 1441.7 KB)
14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added 
input-0-1396614435200 in memory on computer1.ant-net:60820 (size: 1295.7 
KB, free: 589.3 KB)
14/04/04 14:26:57 INFO TaskSetManager: Finished TID 696 in 56 ms on 
computer1.ant-net (progress: 0/1)

14/04/04 14:26:57 INFO TaskSchedulerImpl: Remove TaskSet 183.0 from pool
14/04/04 14:26:57 INFO DAGScheduler: Completed ResultTask(183, 0)
14/04/04 14:26:57 INFO DAGScheduler: Stage 183 (take at 
DStream.scala:594) finished in 0.057 s
14/04/04 14:26:57 INFO SparkContext: Job finished: take at 
DStream.scala:594, took 1.575268894 s

---
Time: 1396614359000 ms
---
(hapless,9)
(reapers,8)
(amazed,113)
(feebleness,7)
(offer,148)
(rabble,27)
(exchanging,7)
(merchant,20)
(incentives,2)
(quarrel,48)
...


Thanks Guys

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


how to save RDD partitions in different folders?

2014-04-04 Thread dmpour23
Hi all,
Say I have an input file which I would like to partition using
HashPartitioner k times.

Calling  rdd.saveAsTextFile(hdfs://); will save k files as part-0
part-k  
Is there a way to save each partition in specific folders?

i.e. src
  part0/part-0 
  part1/part-1
  part1/part-k

thanks
Dimitri





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Spark 1.0.0 release plan

2014-04-04 Thread Tom Graves
Do we have a list of things we really want to get in for 1.X?   Perhaps move 
any jira out to a 1.1 release if we aren't targetting them for 1.0.

 It might be nice to send out reminders when these dates are approaching. 

Tom
On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com wrote:
 
Thanks a lot guys!





On Fri, Apr 4, 2014 at 5:34 AM, Patrick Wendell pwend...@gmail.com wrote:

Btw - after that initial thread I proposed a slightly more detailed set of 
dates:

https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage 


- Patrick



On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia matei.zaha...@gmail.com wrote:

Hey Bhaskar, this is still the plan, though QAing might take longer than 15 
days. Right now since we’ve passed April 1st, the only features considered for 
a merge are those that had pull requests in review before. (Some big ones are 
things like annotating the public APIs and simplifying configuration). Bug 
fixes and things like adding Python / Java APIs for new components will also 
still be considered.


Matei


On Apr 3, 2014, at 10:30 AM, Bhaskar Dutta bhas...@gmail.com wrote:

Hi,


Is there any change in the release plan for Spark 1.0.0-rc1 release date 
from what is listed in the Proposal for Spark Release Strategy thread?
== Tentative Release Window for 1.0.0 ==
Feb 1st - April 1st: General development
April 1st: Code freeze for new features
April 15th: RC1
Thanks,
Bhaskar



Re: How to create a RPM package

2014-04-04 Thread Christophe Préaud

Hi Rahul,

Spark will be available in Fedora 21 (see: 
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently 
scheduled on 2014-10-14 but they already have produced spec files and source 
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec 
file, which you can probably adapt to your need.

Christophe.

On 04/04/2014 09:10, Rahul Singhal wrote:
Hello Community,

This is my first mail to the list and I have a small question. The maven build 
pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages
 mentions a way to create a debian package but I was wondering if there is a simple 
way (preferably through maven) to create a RPM package. Is there a script (which is 
probably used for spark releases) that I can get my hands on? Or should I write one 
on my own?

P.S. I don't want to use the alien software to convert a debian package to a 
RPM.

Thanks,
Rahul Singhal



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.
Name: spark
Version:  0.9.0

# Build time settings
%global _full_version %{version}-incubating
%global _final_name %{name}-%{_full_version}
%global _spark_hadoop_version 2.2.0
%global _spark_dir /opt

Release:  2
Summary:  Lightning-fast cluster computing
Group:Development/Libraries
License:  ASL 2.0
URL:  http://spark.apache.org/
Source0:  http://www.eu.apache.org/dist/incubator/spark/%{_final_name}/%{_final_name}.tgz
BuildRequires: git
Requires:  /bin/bash
Requires:  /bin/sh
Requires:  /usr/bin/env

%description
Apache Spark is a fast and general engine for large-scale data processing.


%prep
%setup -q -n %{_final_name}


%build
SPARK_HADOOP_VERSION=%{_spark_hadoop_version} SPARK_YARN=true ./sbt/sbt assembly
find bin -type f -name '*.cmd' -exec rm -f {} \;


%install
mkdir -p ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/{conf,jars}
echo Spark %{_full_version} built for Hadoop %{_spark_hadoop_version}  ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/RELEASE
cp assembly/target/scala*/spark-assembly-%{_full_version}-hadoop%{_spark_hadoop_version}.jar ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/jars/spark-assembly-hadoop.jar
cp conf/*.template ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/conf
cp -r bin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}
cp -r python ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}
cp -r sbin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}


%files
%defattr(-,root,root,-)
%{_spark_dir}/%{name}

%changelog
* Mon Mar 31 2014 Christophe Préaud christophe.pre...@kelkoo.com 0.9.0-2
- Use description and Summary from Fedora RPM

* Wed Mar 26 2014 Christophe Préaud christophe.pre...@kelkoo.com 0.9.0-1
- first version with changelog :-)


Driver increase memory utilization

2014-04-04 Thread Eduardo Costa Alfaia

   Hi Guys,

Could anyone help me understand this driver behavior when I start the 
JavaNetworkWordCount?


computer8
 16:24:07 up 121 days, 22:21, 12 users,  load average: 0.66, 1.27, 1.55
total   used   free shared buffers 
cached

Mem:  5897   4341 1555 0227   2798
-/+ buffers/cache:   1315   4581
Swap:0  0  0

in 2 minutes

computer8
 16:23:08 up 121 days, 22:20, 12 users,  load average: 0.80, 1.43, 1.62
 total   used   free shared buffers 
cached

Mem:  5897   5866 30 0230   3255
-/+ buffers/cache:   2380   3516
Swap:0  0  0


Thanks

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Hi All,

I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
already being addressed, but I am having a devil of a time with a spark
0.9.0 client jar for hadoop 2.X. If I go to the site and download:


   - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache
mirror 
http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz
   or direct file
downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz

I get a jar with what appears to be hadoop 1.0.4 that fails when using
hadoop 2.3.0. I have tried repeatedly to build the source tree with the
correct options per the documentation but always seemingly ending up with
hadoop 1.0.4.  As far as I can tell the reason that the jar available on
the web site doesn't have the correct hadoop client in it, is because the
build itself is having that problem.

I am about to try to troubleshoot the build but wanted to see if anyone out
there has encountered the same problem and/or if I am just doing something
dumb (!)


Anyone else using hadoop 2.X? How do you get the right client jar if so?

cheers,
Erik

-- 
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Rahul Singhal
Hi Erik,

I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for 
maven build:

export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package


And from http://spark.apache.org/docs/latest/running-on-yarn.html, for sbt 
build, you could try:

SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly

Thanks,
Rahul Singhal

From: Erik Freed erikjfr...@codecision.commailto:erikjfr...@codecision.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Friday 4 April 2014 7:58 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem

Hi All,

I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps 
already being addressed, but I am having a devil of a time with a spark 0.9.0 
client jar for hadoop 2.X. If I go to the site and download:


  *   Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 
http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz
 or direct file 
downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz

I get a jar with what appears to be hadoop 1.0.4 that fails when using hadoop 
2.3.0. I have tried repeatedly to build the source tree with the correct 
options per the documentation but always seemingly ending up with hadoop 1.0.4. 
 As far as I can tell the reason that the jar available on the web site doesn't 
have the correct hadoop client in it, is because the build itself is having 
that problem.

I am about to try to troubleshoot the build but wanted to see if anyone out 
there has encountered the same problem and/or if I am just doing something dumb 
(!)


Anyone else using hadoop 2.X? How do you get the right client jar if so?

cheers,
Erik

--
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.commailto:erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Amit Tewari
I believe you got to set following

SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is)
SPARK_YARN=true

then type sbt/sbt assembly

If you are using Maven to compile

mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean
package


Hope this helps

-A


On Fri, Apr 4, 2014 at 7:28 AM, Erik Freed erikjfr...@codecision.comwrote:

 Hi All,

 I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
 already being addressed, but I am having a devil of a time with a spark
 0.9.0 client jar for hadoop 2.X. If I go to the site and download:


- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 
 http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz
or direct file 
 downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz

 I get a jar with what appears to be hadoop 1.0.4 that fails when using
 hadoop 2.3.0. I have tried repeatedly to build the source tree with the
 correct options per the documentation but always seemingly ending up with
 hadoop 1.0.4.  As far as I can tell the reason that the jar available on
 the web site doesn't have the correct hadoop client in it, is because the
 build itself is having that problem.

 I am about to try to troubleshoot the build but wanted to see if anyone
 out there has encountered the same problem and/or if I am just doing
 something dumb (!)


 Anyone else using hadoop 2.X? How do you get the right client jar if so?

 cheers,
 Erik

 --
 Erik James Freed
 CoDecision Software
 510.859.3360
 erikjfr...@codecision.com

 1480 Olympus Avenue
 Berkeley, CA
 94708

 179 Maria Lane
 Orcas, WA
 98245



Re: how to save RDD partitions in different folders?

2014-04-04 Thread Konstantin Kudryavtsev
Hi Evan,

Could you please provide a code-snippet? Because it not clear for me, in
Hadoop you need to engage addNamedOutput method and I'm in stuck how to use
it from Spark

Thank you,
Konstantin Kudryavtsev


On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote:

 Have a look at MultipleOutputs in the hadoop API. Spark can read and write
 to arbitrary hadoop formats.

  On Apr 4, 2014, at 6:01 AM, dmpour23 dmpou...@gmail.com wrote:
 
  Hi all,
  Say I have an input file which I would like to partition using
  HashPartitioner k times.
 
  Calling  rdd.saveAsTextFile(hdfs://); will save k files as part-0
  part-k
  Is there a way to save each partition in specific folders?
 
  i.e. src
   part0/part-0
   part1/part-1
   part1/part-k
 
  thanks
  Dimitri
 
 
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi,
  Can you explain a little more what's going on? Which one submits a job to the 
yarn cluster that creates an application master and spawns containers for the 
local jobs? I tried yarn-client and submitted to our yarn cluster and it seems 
to work that way.  Shouldn't Client.scala be running within the AppMaster 
instance in this run mode?
  How exactly does yarn-standalone work?

Thanks,
Ron

Sent from my iPhone

 On Apr 3, 2014, at 11:19 AM, Kevin Markey kevin.mar...@oracle.com wrote:
 
 We are now testing precisely what you ask about in our environment.  But 
 Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but 
 client vs. standalone and where the client is located on the network 
 relative to the cluster.
 
 The client options that locate the client/master remote from the cluster, 
 while useful for interactive queries, suffer from considerable network 
 traffic overhead as the master schedules and transfers data with the worker 
 nodes on the cluster.  The standalone options locate the master/client on 
 the cluster.  In yarn-standalone, the master is a thread contained by the 
 Yarn Resource Manager.  Lots less traffic, as the master is co-located with 
 the worker nodes on the cluster and its scheduling/data communication has 
 less latency.
 
 In my comparisons between yarn-client and yarn-standalone (so as not to 
 conflate yarn vs Spark), yarn-client computation time is at least double 
 yarn-standalone!  At least for a job with lots of stages and lots of 
 client/worker communication, although rather few collect actions, so it's 
 mainly scheduling that's relevant here.
 
 I'll be posting more information as I have it available.
 
 Kevin
 
 
 On 03/03/2014 03:48 PM, Sandy Ryza wrote:
 Are you running in yarn-standalone mode or yarn-client mode?  Also, what 
 YARN scheduler and what NodeManager heartbeat?  
 
 
 On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote:
 Thanks for the advice Mayur.
 
 I thought I'd report back on the performance difference...  Spark standalone
 mode has executors processing at capacity in under a second :)
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 


Re: Hadoop 2.X Spark Client Jar 0.9.0 problem

2014-04-04 Thread Erik Freed
Thanks all for the update - I have actually built using those options every
which way I can think of so perhaps this is something I am doing about how
I upload the jar to our artifactory repo server. Anyone have a working pom
file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo
server?

cheers,
Erik


On Fri, Apr 4, 2014 at 7:54 AM, Rahul Singhal rahul.sing...@guavus.comwrote:

   Hi Erik,

  I am working with TOT branch-0.9 ( 0.9.1) and the following works for
 me for maven build:

  export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M
 -XX:ReservedCodeCacheSize=512m
 mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean
 package


  And from http://spark.apache.org/docs/latest/running-on-yarn.html, for
 sbt build, you could try:

  SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly

   Thanks,
 Rahul Singhal

   From: Erik Freed erikjfr...@codecision.com
 Reply-To: user@spark.apache.org user@spark.apache.org
 Date: Friday 4 April 2014 7:58 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem

   Hi All,

  I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps
 already being addressed, but I am having a devil of a time with a spark
 0.9.0 client jar for hadoop 2.X. If I go to the site and download:


- Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror 
 http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz
or direct file 
 downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz

 I get a jar with what appears to be hadoop 1.0.4 that fails when using
 hadoop 2.3.0. I have tried repeatedly to build the source tree with the
 correct options per the documentation but always seemingly ending up with
 hadoop 1.0.4.  As far as I can tell the reason that the jar available on
 the web site doesn't have the correct hadoop client in it, is because the
 build itself is having that problem.

  I am about to try to troubleshoot the build but wanted to see if anyone
 out there has encountered the same problem and/or if I am just doing
 something dumb (!)


  Anyone else using hadoop 2.X? How do you get the right client jar if so?

  cheers,
 Erik

  --
 Erik James Freed
 CoDecision Software
 510.859.3360
 erikjfr...@codecision.com

 1480 Olympus Avenue
 Berkeley, CA
 94708

 179 Maria Lane
 Orcas, WA
 98245




-- 
Erik James Freed
CoDecision Software
510.859.3360
erikjfr...@codecision.com

1480 Olympus Avenue
Berkeley, CA
94708

179 Maria Lane
Orcas, WA
98245


Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Regarding Sparkcontext object

2014-04-04 Thread Daniel Siegmann
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 yh18...@gmail.com wrote:

 Is it always needed that sparkcontext object be created in Main method of
 class.Is it necessary?Can we create sc object in other class and try to
 use it by passing this object through function and use it?


The Spark context can be initialized wherever you like and passed around
just as any other object. Just don't try to create multiple contexts
against local (without stopping the previous one first), or you may get
ArrayStoreExceptions (I learned that one the hard way).

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-04 Thread Prasad
Hi Wisely,
Could you please post your pom.xml here.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


RAM Increase

2014-04-04 Thread Eduardo Costa Alfaia

Hi Guys,

Could anyone explain me this behavior? After 2 min of tests

computer1- worker
computer10 - worker
computer8 - driver(master)

computer1
 18:24:31 up 73 days,  7:14,  1 user,  load average: 3.93, 2.45, 1.14
   total   used   free shared buffers 
cached

Mem:  3945   3925 19 0 18   1368
-/+ buffers/cache:   2539   1405
Swap:0  0  0
computer10
 18:22:38 up 44 days, 21:26,  2 users,  load average: 3.05, 2.20, 1.03
 total   used   free shared buffers 
cached

Mem:  5897   5292 604 0 46   2707
-/+ buffers/cache:   2538   3358
Swap:0  0  0
computer8
 18:24:13 up 122 days, 22 min, 13 users,  load average: 1.10, 0.93, 0.82
 total   used   free shared buffers 
cached

Mem:  5897   5841 55 0113   2747
-/+ buffers/cache:   2980   2916
Swap:0  0  0

Thanks Guys

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get
out of it are 2 concurrent tasks at any one time. So setting your default
parallelism to 20 won't help.


On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
e.costaalf...@unibs.it wrote:

 Hi all,

 I have put this line in my spark-env.sh:
 -Dspark.default.parallelism=20

  this parallelism level, is it correct?
  The machine's processor is a dual core.

 Thanks

 --
 Informativa sulla Privacy: http://www.unibs.it/node/8155



Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...

2014-04-04 Thread Marcelo Vanzin
Hi Francis,

This might be a long shot, but do you happen to have built spark on an
encrypted home dir?

(I was running into the same error when I was doing that. Rebuilding
on an unencrypted disk fixed the issue. This is a known issue /
limitation with ecryptfs. It's weird that the build doesn't fail, but
you do get warnings about the long file names.)


On Wed, Apr 2, 2014 at 3:26 AM, Francis.Hu francis...@reachjunction.com wrote:
 I stuck in a NoClassDefFoundError.  Any helps that would be appreciated.

 I download spark 0.9.0 source, and then run this command to build it :
 SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly


 java.lang.NoClassDefFoundError:
 scala/tools/nsc/transform/UnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply$5$$anonfun$scala$tools$nsc$transform$UnCurry$UnCurryTransformer$$anonfun$$anonfun$$transformInConstructor$1$1

-- 
Marcelo


How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
I'm trying to get a clear idea about how exceptions are handled in Spark?
Is there somewhere where I can read about this? I'm on spark .7

For some reason I was under the impression that such exceptions are
swallowed and the value that produced them ignored but the exception is
logged. However, right now we're seeing the task just re-tried over and
over again in an infinite loop because there's a value that always
generates an exception.

John


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a 
SparkException thrown if a task fails more than 4 times), but there were some 
bugs before where this did not happen for non-Serializable exceptions. We 
changed it to pass back the stack traces only (as text), which should always 
work. I’d recommend trying a newer Spark version, 0.8 should be easy to upgrade 
to from 0.7.

Matei

On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote:

 I'm trying to get a clear idea about how exceptions are handled in Spark? Is 
 there somewhere where I can read about this? I'm on spark .7
 
 For some reason I was under the impression that such exceptions are swallowed 
 and the value that produced them ignored but the exception is logged. 
 However, right now we're seeing the task just re-tried over and over again in 
 an infinite loop because there's a value that always generates an exception.
 
 John



Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
 In such construct, each operator builds on the previous one, including any
 materialized results etc. If I use a SQL for each of them, I suspect the
 later SQLs will not leverage the earlier SQLs by any means - hence these
 will be inefficient to first approach. Let me know if this is not correct.


This is not correct.  When you run a SQL statement and register it as a
table, it is the logical plan for this query is used when this virtual
table is referenced in later queries, not the results.  SQL queries are
lazy, just like RDDs and DSL queries.  This is illustrated below.


scala sql(SELECT * FROM selectQuery)
res3: org.apache.spark.sql.SchemaRDD =
SchemaRDD[12] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None

scala sql(SELECT * FROM src).registerAsTable(selectQuery)

scala sql(SELECT key FROM selectQuery)
res5: org.apache.spark.sql.SchemaRDD =
SchemaRDD[24] at RDD at SchemaRDD.scala:93
== Query Plan ==
HiveTableScan [key#8], (MetastoreRelation default, src, None), None

Even though the second query is running over the results of the first
query (which requested all columns using *), the optimizer is still able to
come up with an efficient plan that avoids reading value from the table,
as can be seen by the arguments of the HiveTableScan.

Note that if you call sqlContext.cacheTable(selectQuery) then you are
correct.  The results will be materialized in an in-memory columnar format,
and subsequent queries will be run over these materialized results.


 The reason for building expressions is that the use case needs these to be
 created on the fly based on some case class at runtime.

 I.e., I can't type these in REPL. The scala code will define some case
 class A (a: ... , b: ..., c: ... ) where class name, member names and types
 will be known before hand and the RDD will be defined on this. Then based
 on user action, above pipeline needs to be constructed on fly. Thus the
 expressions has to be constructed on fly from class members and other
 predicates etc., most probably using expression constructors.

 Could you please share how expressions could be constructed using the APIs
 on expression (and not on REPL) ?


I'm not sure I completely understand the use case here, but you should be
able to construct symbols and use the DSL to create expressions at runtime,
just like in the REPL.

val attrName: String = name
val addExpression: Expression = Symbol(attrName) + Symbol(attrName)

There is currently no public API for constructing expressions manually
other than SQL or the DSL.  While you could dig into
org.apache.spark.sql.catalyst.expressions._, these APIs are considered
internal, and *will not be stable in between versions*.

Michael


Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

What do you advice me Nicholas?

Em 4/4/14, 19:05, Nicholas Chammas escreveu:
If you're running on one machine with 2 cores, I believe all you can 
get out of it are 2 concurrent tasks at any one time. So setting your 
default parallelism to 20 won't help.



On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote:


Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Is there a way to log exceptions inside a mapping function? logError and
logInfo seem to freeze things.


On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Exceptions should be sent back to the driver program and logged there
 (with a SparkException thrown if a task fails more than 4 times), but there
 were some bugs before where this did not happen for non-Serializable
 exceptions. We changed it to pass back the stack traces only (as text),
 which should always work. I'd recommend trying a newer Spark version, 0.8
 should be easy to upgrade to from 0.7.

 Matei

 On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote:

  I'm trying to get a clear idea about how exceptions are handled in
 Spark? Is there somewhere where I can read about this? I'm on spark .7
 
  For some reason I was under the impression that such exceptions are
 swallowed and the value that produced them ignored but the exception is
 logged. However, right now we're seeing the task just re-tried over and
 over again in an infinite loop because there's a value that always
 generates an exception.
 
  John




Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread John Salvatier
Btw, thank you for your help.


On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote:

 Is there a way to log exceptions inside a mapping function? logError and
 logInfo seem to freeze things.


 On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 Exceptions should be sent back to the driver program and logged there
 (with a SparkException thrown if a task fails more than 4 times), but there
 were some bugs before where this did not happen for non-Serializable
 exceptions. We changed it to pass back the stack traces only (as text),
 which should always work. I'd recommend trying a newer Spark version, 0.8
 should be easy to upgrade to from 0.7.

 Matei

 On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote:

  I'm trying to get a clear idea about how exceptions are handled in
 Spark? Is there somewhere where I can read about this? I'm on spark .7
 
  For some reason I was under the impression that such exceptions are
 swallowed and the value that produced them ignored but the exception is
 logged. However, right now we're seeing the task just re-tried over and
 over again in an infinite loop because there's a value that always
 generates an exception.
 
  John





Re: Example of creating expressions for SchemaRDD methods

2014-04-04 Thread Michael Armbrust
Minor typo in the example.  The first SELECT statement should actually be:

sql(SELECT * FROM src)

Where `src` is a HiveTable with schema (key INT value STRING).


On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote:


 In such construct, each operator builds on the previous one, including any
 materialized results etc. If I use a SQL for each of them, I suspect the
 later SQLs will not leverage the earlier SQLs by any means - hence these
 will be inefficient to first approach. Let me know if this is not correct.


 This is not correct.  When you run a SQL statement and register it as a
 table, it is the logical plan for this query is used when this virtual
 table is referenced in later queries, not the results.  SQL queries are
 lazy, just like RDDs and DSL queries.  This is illustrated below.


 scala sql(SELECT * FROM selectQuery)
 res3: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[12] at RDD at SchemaRDD.scala:93
 == Query Plan ==
 HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None

 scala sql(SELECT * FROM src).registerAsTable(selectQuery)

 scala sql(SELECT key FROM selectQuery)
 res5: org.apache.spark.sql.SchemaRDD =
 SchemaRDD[24] at RDD at SchemaRDD.scala:93
 == Query Plan ==
 HiveTableScan [key#8], (MetastoreRelation default, src, None), None

 Even though the second query is running over the results of the first
 query (which requested all columns using *), the optimizer is still able to
 come up with an efficient plan that avoids reading value from the table,
 as can be seen by the arguments of the HiveTableScan.

 Note that if you call sqlContext.cacheTable(selectQuery) then you are
 correct.  The results will be materialized in an in-memory columnar format,
 and subsequent queries will be run over these materialized results.


 The reason for building expressions is that the use case needs these to
 be created on the fly based on some case class at runtime.

 I.e., I can't type these in REPL. The scala code will define some case
 class A (a: ... , b: ..., c: ... ) where class name, member names and types
 will be known before hand and the RDD will be defined on this. Then based
 on user action, above pipeline needs to be constructed on fly. Thus the
 expressions has to be constructed on fly from class members and other
 predicates etc., most probably using expression constructors.

 Could you please share how expressions could be constructed using the
 APIs on expression (and not on REPL) ?


 I'm not sure I completely understand the use case here, but you should be
 able to construct symbols and use the DSL to create expressions at runtime,
 just like in the REPL.

 val attrName: String = name
 val addExpression: Expression = Symbol(attrName) + Symbol(attrName)

 There is currently no public API for constructing expressions manually
 other than SQL or the DSL.  While you could dig into
 org.apache.spark.sql.catalyst.expressions._, these APIs are considered
 internal, and *will not be stable in between versions*.

 Michael






Largest Spark Cluster

2014-04-04 Thread Parviz Deyhim
Spark community,


What's the size of the largest Spark cluster ever deployed? I've heard
Yahoo is running Spark on several hundred nodes but don't know the actual
number.

can someone share?

Thanks


Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you want more parallelism, you need more cores. So, use a machine with
more cores, or use a cluster of machines.
spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the
easiest way to do this.

If you're stuck on a single machine with 2 cores, then set your default
parallelism to 2. Setting it to a higher number won't do anything helpful.


On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it
 wrote:

  What do you advice me Nicholas?

 Em 4/4/14, 19:05, Nicholas Chammas escreveu:

 If you're running on one machine with 2 cores, I believe all you can get
 out of it are 2 concurrent tasks at any one time. So setting your default
 parallelism to 20 won't help.


 On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
 e.costaalf...@unibs.it wrote:

 Hi all,

 I have put this line in my spark-env.sh:
 -Dspark.default.parallelism=20

  this parallelism level, is it correct?
  The machine's processor is a dual core.

 Thanks

 --
 Informativa sulla Privacy: http://www.unibs.it/node/8155




 Informativa sulla Privacy: http://www.unibs.it/node/8155



Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
FYI, one thing we’ve added now is support for reading multiple text files from 
a directory as separate records: https://github.com/apache/spark/pull/327. This 
should remove the need for mapPartitions discussed here.

Avro and SequenceFiles look like they may not make it for 1.0, but there’s a 
chance that Parquet support with Spark SQL will, which should let you store 
binary data a bit better.

Matei

On Mar 19, 2014, at 3:12 PM, Jeremy Freeman freeman.jer...@gmail.com wrote:

 Another vote on this, support for simple SequenceFiles and/or Avro would be 
 terrific, as using plain text can be very space-inefficient, especially for 
 numerical data.
 
 -- Jeremy
 
 On Mar 19, 2014, at 5:24 PM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 
 I'd second the request for Avro support in Python first, followed by Parquet.
 
 
 On Wed, Mar 19, 2014 at 2:14 PM, Evgeny Shishkin itparan...@gmail.com 
 wrote:
 
 On 19 Mar 2014, at 19:54, Diana Carroll dcarr...@cloudera.com wrote:
 
 Actually, thinking more on this question, Matei: I'd definitely say support 
 for Avro.  There's a lot of interest in this!!
 
 
 Agree, and parquet as default Cloudera Impala format.
 
 
 
 
 On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.com 
 wrote:
 BTW one other thing — in your experience, Diana, which non-text 
 InputFormats would be most useful to support in Python first? Would it be 
 Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or 
 something else? I think a per-file text input format that does the stuff we 
 did here would also be good.
 
 Matei
 
 
 On Mar 18, 2014, at 3:27 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Hi Diana,
 
 This seems to work without the iter() in front if you just return 
 treeiterator. What happened when you didn’t include that? Treeiterator 
 should return an iterator.
 
 Anyway, this is a good example of mapPartitions. It’s one where you want 
 to view the whole file as one object (one XML here), so you couldn’t 
 implement this using a flatMap, but you still want to return multiple 
 values. The MLlib example you saw needs Python 2.7 because unfortunately 
 that is a requirement for our Python MLlib support (see 
 http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries).
  We’d like to relax this later but we’re using some newer features of 
 NumPy and Python. The rest of PySpark works on 2.6.
 
 In terms of the size in memory, here both the string s and the XML tree 
 constructed from it need to fit in, so you can’t work on very large 
 individual XML files. You may be able to use a streaming XML parser 
 instead to extract elements from the data in a streaming fashion, without 
 every materializing the whole tree. 
 http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreader
  is one example.
 
 Matei
 
 On Mar 18, 2014, at 7:49 AM, Diana Carroll dcarr...@cloudera.com wrote:
 
 Well, if anyone is still following this, I've gotten the following code 
 working which in theory should allow me to parse whole XML files: (the 
 problem was that I can't return the tree iterator directly.  I have to 
 call iter().  Why?)
 
 import xml.etree.ElementTree as ET
 
 # two source files, format data country 
 name=../country.../data
 mydata=sc.textFile(file:/home/training/countries*.xml) 
 
 def parsefile(iterator):
 s = ''
 for i in iterator: s = s + str(i)
 tree = ET.fromstring(s)
 treeiterator = tree.getiterator(country)
 # why to I have to convert an iterator to an iterator?  not sure but 
 required
 return iter(treeiterator)
 
 mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element: 
 element.attrib).collect()
 
 The output is what I expect:
 [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}]
 
 BUT I'm a bit concerned about the construction of the string s.  How 
 big can my file be before converting it to a string becomes problematic?
 
 
 
 On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll dcarr...@cloudera.com 
 wrote:
 Thanks, Matei.
 
 In the context of this discussion, it would seem mapParitions is 
 essential, because it's the only way I'm going to be able to process each 
 file as a whole, in our example of a large number of small XML files 
 which need to be parsed as a whole file because records are not required 
 to be on a single line.
 
 The theory makes sense but I'm still utterly lost as to how to implement 
 it.  Unfortunately there's only a single example of the use of 
 mapPartitions in any of the Python example programs, which is the log 
 regression example, which I can't run because it requires Python 2.7 and 
 I'm on Python 2.6.  (aside: I couldn't find any statement that Python 2.6 
 is unsupported...is it?)
 
 I'd really really love to see a real life example of a Python use of 
 mapPartitions.  I do appreciate the very simple examples you provided, 
 but (perhaps because of my novice status on Python) I can't figure out 
 

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as 
long as the cluster is stopped. If the cluster is stopped, just go into the AWS 
Console, right click a slave and choose “launch more of these” to add more. Or 
select multiple slaves and delete them. When you run spark-ec2 start the next 
time to start your cluster, it will set it up on all the machines it finds in 
the mycluster-slaves security group.

This is pretty hacky so it would definitely be good to add this feature; feel 
free to open a JIRA about it.

Matei

On Apr 4, 2014, at 12:16 PM, Nicholas Chammas nicholas.cham...@gmail.com 
wrote:

 I would like to be able to use spark-ec2 to launch new slaves and add them to 
 an existing, running cluster. Similarly, I would also like to remove slaves 
 from an existing cluster.
 
 Use cases include:
 Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves.
 During scheduled batch processing, I want to add some new slaves, perhaps on 
 spot instances. When that processing is done, I want to kill them. (Cruel, I 
 know.)
 I gather this is not possible at the moment. spark-ec2 appears to be able to 
 launch new slaves for an existing cluster only if the master is stopped. I 
 also do not see any ability to remove slaves from a cluster.
 
 Is that correct? Are there plans to add such functionality to spark-ec2 in 
 the future?
 
 Nick
 
 
 View this message in context: Having spark-ec2 join new slaves to existing 
 cluster
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: How to create a RPM package

2014-04-04 Thread Rahul Singhal
Hi Christophe,

Thanks for your reply and the spec file. I have solved my issue for now. I 
didn't want to rely building spark using the spec file (%build section) as I 
don't want to be maintaining the list of files that need to be packaged. I 
ended up adding maven build support to make-distribution.sh. This script 
produces a tar ball which I can then use to create a RPM package.

Thanks,
Rahul Singhal

From: Christophe Préaud 
christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Date: Friday 4 April 2014 7:55 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: How to create a RPM package

Hi Rahul,

Spark will be available in Fedora 21 (see: 
https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently 
scheduled on 2014-10-14 but they already have produced spec files and source 
RPMs.
If you are stuck with EL6 like me, you can have a look at the attached spec 
file, which you can probably adapt to your need.

Christophe.

On 04/04/2014 09:10, Rahul Singhal wrote:
Hello Community,

This is my first mail to the list and I have a small question. The maven build 
pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages
 mentions a way to create a debian package but I was wondering if there is a 
simple way (preferably through maven) to create a RPM package. Is there a 
script (which is probably used for spark releases) that I can get my hands on? 
Or should I write one on my own?

P.S. I don't want to use the alien software to convert a debian package to a 
RPM.

Thanks,
Rahul Singhal



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: reduceByKeyAndWindow Java

2014-04-04 Thread Eduardo Costa Alfaia

Hi Tathagata,

You are right, this code compile, but I am some problems with high 
memory consummation, I sent today some email about this, but no response 
until now.


Thanks
Em 4/4/14, 22:56, Tathagata Das escreveu:
I havent really compiled the code, but it looks good to me. Why? Is 
there any problem you are facing?


TD


On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia 
e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote:



Hi guys,

I would like knowing if the part of code is right to use in Window.

JavaPairDStreamString, Integer wordCounts = words.map(
103   new PairFunctionString, String, Integer() {
104 @Override
105 public Tuple2String, Integer call(String s) {
106   return new Tuple2String, Integer(s, 1);
107 }
108   }).reduceByKeyAndWindow(
109 new Function2Integer, Integer, Integer() {
110   public Integer call(Integer i1, Integer i2) { return
i1 + i2; }
111 },
112 new Function2Integer, Integer, Integer() {
113   public Integer call(Integer i1, Integer i2) { return
i1 - i2; }
114 },
115 new Duration(60 * 5 * 1000),
116 new Duration(1 * 1000)
117   );



Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Andrew Or
Logging inside a map function shouldn't freeze things. The messages
should be logged on the worker logs, since the code is executed on the
executors. If you throw a SparkException, however, it'll be propagated to
the driver after it has failed 4 or more times (by default).

On Fri, Apr 4, 2014 at 11:57 AM, John Salvatier jsalvat...@gmail.comwrote:

 Btw, thank you for your help.


 On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote:

 Is there a way to log exceptions inside a mapping function? logError and
 logInfo seem to freeze things.


 On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia 
 matei.zaha...@gmail.comwrote:

 Exceptions should be sent back to the driver program and logged there
 (with a SparkException thrown if a task fails more than 4 times), but there
 were some bugs before where this did not happen for non-Serializable
 exceptions. We changed it to pass back the stack traces only (as text),
 which should always work. I'd recommend trying a newer Spark version, 0.8
 should be easy to upgrade to from 0.7.

 Matei

 On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com
 wrote:

  I'm trying to get a clear idea about how exceptions are handled in
 Spark? Is there somewhere where I can read about this? I'm on spark .7
 
  For some reason I was under the impression that such exceptions are
 swallowed and the value that produced them ignored but the exception is
 logged. However, right now we're seeing the task just re-tried over and
 over again in an infinite loop because there's a value that always
 generates an exception.
 
  John






Re: Spark output compression on HDFS

2014-04-04 Thread Azuryy
There is no compress type for snappy.


Sent from my iPhone5s

 On 2014年4月4日, at 23:06, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 
 Can anybody suggest how to change compression level (Record, Block) for 
 Snappy? 
 if it possible, of course
 
 thank you in advance
 
 Thank you,
 Konstantin Kudryavtsev
 
 
 On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 Thanks all, it works fine now and I managed to compress output. However, I 
 am still in stuck... How is it possible to set compression type for Snappy? 
 I mean to set up record or block level of compression for output
 
 On Apr 3, 2014 1:15 AM, Nicholas Chammas nicholas.cham...@gmail.com 
 wrote:
 Thanks for pointing that out.
 
 
 On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.com 
 wrote:
 First, you shouldn't be using spark.incubator.apache.org anymore, just 
 spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist in 
 the Python API at this point. 
 
 
 On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:
 Is this a Scala-only feature?
 
 
 On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com 
 wrote:
 For textFile I believe we overload it and let you set a codec directly:
 
 https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
 
 For saveAsSequenceFile yep, I think Mark is right, you need an option.
 
 
 On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra m...@clearstorydata.com 
 wrote:
 http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
 
 The signature is 'def saveAsSequenceFile(path: String, codec: 
 Option[Class[_ : CompressionCodec]] = None)', but you are providing a 
 Class, not an Option[Class].  
 
 Try counts.saveAsSequenceFile(output, 
 Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
 
 
 
 On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev 
 kudryavtsev.konstan...@gmail.com wrote:
 Hi there,
 
 
 
 I've started using Spark recently and evaluating possible use cases in 
 our company. 
 
 I'm trying to save RDD as compressed Sequence file. I'm able to save 
 non-compressed file be calling:
 
 
 
 
 
 counts.saveAsSequenceFile(output)
 where counts is my RDD (IntWritable, Text). However, I didn't manage 
 to compress output. I tried several configurations and always got 
 exception:
 
 
 
 
 
 counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 console:21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
  required: Option[Class[_ : 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.SnappyCodec])
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 console:21: error: type mismatch;
  found   : 
 Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
  required: Option[Class[_ : 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.spark.io.SnappyCompressionCodec])
 and it doesn't work even for Gzip:
 
 
 
 
 
  counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 console:21: error: type mismatch;
  found   : 
 Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
  required: Option[Class[_ : 
 org.apache.hadoop.io.compress.CompressionCodec]]
   counts.saveAsSequenceFile(output, 
 classOf[org.apache.hadoop.io.compress.GzipCodec])
 Could you please suggest solution? also, I didn't find how is it 
 possible to specify compression parameters (i.e. compression type for 
 Snappy). I wondered if you could share code snippets for 
 writing/reading RDD with compression? 
 
 Thank you in advance,
 
 Konstantin Kudryavtsev
 


Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
All

Are there any drawbacks or technical challenges (or any information, really) 
related to using Spark directly on a global parallel filesystem  like 
Lustre/GPFS?

Any idea of what would be involved in doing a minimal proof of concept? Is it 
just possible to run Spark unmodified (without the HDFS substrate) for a start, 
or will that not work at all? I do know that it’s possible to implement Tachyon 
on Lustre and get the HDFS interface – just looking at other options.

Venkat


Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should 
be able to just run Spark and use a file:// URL for your files.

The only downside with running it this way is that Lustre won’t expose data 
locality info to Spark, the way HDFS does. That may not matter if it’s a 
network-mounted file system though.

Matei

On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com wrote:

 All
 
 Are there any drawbacks or technical challenges (or any information, really) 
 related to using Spark directly on a global parallel filesystem  like 
 Lustre/GPFS? 
 
 Any idea of what would be involved in doing a minimal proof of concept? Is it 
 just possible to run Spark unmodified (without the HDFS substrate) for a 
 start, or will that not work at all? I do know that it’s possible to 
 implement Tachyon on Lustre and get the HDFS interface – just looking at 
 other options.
 
 Venkat



Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
Thanks will take a look...

Sent from my iPad

 On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu 
 wrote:
 
 We use avro objects in our project, and have a Kryo serializer for generic 
 Avro SpecificRecords. Take a look at:
 
 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala
 
 Also, Matt Massie has a good blog post about this at 
 http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
 
 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466
 
 
 On Thu, Apr 3, 2014 at 7:16 AM, Ian O'Connell i...@ianoconnell.com wrote:
 Objects been transformed need to be one of these in flight. Source data can 
 just use the mapreduce input formats, so anything you can do with mapred. 
 doing an avro one for this you probably want one of :
 https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf*
 
 or just whatever your using at the moment to open them in a MR job probably 
 could be re-purposed
 
 
 On Thu, Apr 3, 2014 at 7:11 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 
 Hi,
   I know that sources need to either be java serializable or use kryo 
 serialization.
   Does anyone have sample code that reads, transforms and writes avro files 
 in spark?
 
 Thanks,
 Ron
 


exactly once

2014-04-04 Thread Bharath Bhushan
Does spark in general assure exactly once semantics? What happens to 
those guarantees in the presence of updateStateByKey operations -- are 
they also assured to be exactly once?


Thanks
manku.timma at outlook dot com


Re: Largest Spark Cluster

2014-04-04 Thread Patrick Wendell
Hey Parviz,

There was a similar thread a while ago... I think that many companies like
to be discrete about the size of large clusters. But of course it would be
great if people wanted to share openly :)

For my part - I can say that Spark has been benchmarked on
hundreds-of-nodes clusters before and on jobs that crunch hundreds of
terabytes (uncompressed) of data.

- Patrick


On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim pdey...@gmail.com wrote:

 Spark community,


 What's the size of the largest Spark cluster ever deployed? I've heard
 Yahoo is running Spark on several hundred nodes but don't know the actual
 number.

 can someone share?

 Thanks



Re: How to create a RPM package

2014-04-04 Thread Patrick Wendell
We might be able to incorporate the maven rpm plugin into our build. If
that can be done in an elegant way it would be nice to have that
distribution target for people who wanted to try this with arbitrary Spark
versions...

Personally I have no familiarity with that plug-in, so curious if anyone in
the community has feedback from trying this.

- Patrick


On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal rahul.sing...@guavus.comwrote:

   Hi Christophe,

  Thanks for your reply and the spec file. I have solved my issue for now.
 I didn't want to rely building spark using the spec file (%build section)
 as I don't want to be maintaining the list of files that need to be
 packaged. I ended up adding maven build support to make-distribution.sh.
 This script produces a tar ball which I can then use to create a RPM
 package.

   Thanks,
 Rahul Singhal

   From: Christophe Préaud christophe.pre...@kelkoo.com
 Reply-To: user@spark.apache.org user@spark.apache.org
 Date: Friday 4 April 2014 7:55 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Re: How to create a RPM package

   Hi Rahul,

 Spark will be available in Fedora 21 (see:
 https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently
 scheduled on 2014-10-14 but they already have produced spec files and
 source RPMs.
 If you are stuck with EL6 like me, you can have a look at the attached
 spec file, which you can probably adapt to your need.

 Christophe.

 On 04/04/2014 09:10, Rahul Singhal wrote:

  Hello Community,

  This is my first mail to the list and I have a small question. The maven
 build 
 pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages
  mentions
 a way to create a debian package but I was wondering if there is a simple
 way (preferably through maven) to create a RPM package. Is there a script
 (which is probably used for spark releases) that I can get my hands on? Or
 should I write one on my own?

  P.S. I don't want to use the alien software to convert a debian
 package to a RPM.

   Thanks,
  Rahul Singhal



 --
 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de EURO 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



Re: Spark on other parallel filesystems

2014-04-04 Thread Jeremy Freeman
We run Spark (in Standalone mode) on top of a network-mounted file system 
(NFS), rather than HDFS, and find it to work great. It required no modification 
or special configuration to set this up; as Matei says, we just point Spark to 
data using the file location.

-- Jeremy

On Apr 4, 2014, at 8:12 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 As long as the filesystem is mounted at the same path on every node, you 
 should be able to just run Spark and use a file:// URL for your files.
 
 The only downside with running it this way is that Lustre won’t expose data 
 locality info to Spark, the way HDFS does. That may not matter if it’s a 
 network-mounted file system though.
 
 Matei
 
 On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com wrote:
 
 All
 
 Are there any drawbacks or technical challenges (or any information, really) 
 related to using Spark directly on a global parallel filesystem  like 
 Lustre/GPFS? 
 
 Any idea of what would be involved in doing a minimal proof of concept? Is 
 it just possible to run Spark unmodified (without the HDFS substrate) for a 
 start, or will that not work at all? I do know that it’s possible to 
 implement Tachyon on Lustre and get the HDFS interface – just looking at 
 other options.
 
 Venkat
 



Re: Spark on other parallel filesystems

2014-04-04 Thread Anand Avati
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote:

 As long as the filesystem is mounted at the same path on every node, you
 should be able to just run Spark and use a file:// URL for your files.

 The only downside with running it this way is that Lustre won't expose
 data locality info to Spark, the way HDFS does. That may not matter if it's
 a network-mounted file system though.


Is the locality querying mechanism specific to HDFS mode, or is it possible
to implement plugins in Spark to query location in other ways on other
filesystems? I ask because, glusterfs can expose data location of a file
through virtual extended attributes and I would be interested in making
Spark exploit that locality when the file location is specified as
glusterfs:// (or querying the xattr blindly for file://). How much of a
difference does data locality make for Spark use cases anyways (since most
of the computation happens in memory)? Any sort of numbers?

Thanks!
Avati




Matei

 On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com
 wrote:

  All

  Are there any drawbacks or technical challenges (or any information,
 really) related to using Spark directly on a global parallel filesystem
  like Lustre/GPFS?

  Any idea of what would be involved in doing a minimal proof of concept?
 Is it just possible to run Spark unmodified (without the HDFS substrate)
 for a start, or will that not work at all? I do know that it's possible to
 implement Tachyon on Lustre and get the HDFS interface - just looking at
 other options.

  Venkat