Re: module not found: org.eclipse.paho#mqtt-client;0.4.0
This ultimately means a problem with SSL in the version of Java you are using to run SBT. If you look around the internet, you'll see a bunch of discussion, most of which seems to boil down to reinstall, or update, Java. -- Sean Owen | Director, Data Science | London On Fri, Apr 4, 2014 at 12:21 PM, Dear all alicksc...@163.com wrote: hello, all i am a new guy to sparkscala. Yestday i install spark failed, and the message like this: who can help me: why the matt-client-0.4.0.pom can't find? how should i do ? thanks a lot! command: sbt/sbt assembly [info] Updating {file:/Users/alick/spark/spark-0.9.0-incubating/}external-mqtt... [info] Resolving org.eclipse.paho#mqtt-client;0.4.0 ... [error] Server access Error: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty url=https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom [error] Server access Error: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty url=https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom [error] Server access Error: java.lang.RuntimeException: Unexpected error: java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty url=https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom [warn] module not found: org.eclipse.paho#mqtt-client;0.4.0 [warn] local: tried [warn] /Users/alick/.ivy2/local/org.eclipse.paho/mqtt-client/0.4.0/ivys/ivy.xml [warn] Local Maven Repo: tried [warn] sonatype-snapshots: tried [warn] https://oss.sonatype.org/content/repositories/snapshots/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom [warn] sonatype-staging: tried [warn] https://oss.sonatype.org/service/local/staging/deploy/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom [warn] Eclipse Repo: tried [warn] https://repo.eclipse.org/content/repositories/paho-releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom [warn] public: tried [warn] http://repo1.maven.org/maven2/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
Explain Add Input
Hi all, Could anyone explain me about the lines below? computer1 - worker computer8 - driver(master) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer1.ant-net:60820 (size: 1262.5 KB, free: 540.3 MB) 14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(1292780) called with curMem=49555672, maxMem=825439027 14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314800 stored as bytes to memory (size 1262.5 KB, free 738.7 MB) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314800 in memory on computer8.ant-net:49743 (size: 1262.5 KB, free: 738.7 MB) Why does spark add the same input in computer8, which is the Driver(master)? Thanks guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
RAM high consume
Hi all, I am doing some tests using JavaNetworkWordcount and I have some questions about the performance machine, my tests' time are approximately 2 min. Why does the RAM Memory decrease meaningly? I have done tests with 2, 3 machines and I had gotten the same behavior. What should I do to get a better performance in this case? # Star Test computer1 total used free sharedbuffers cached Mem: 3945711 3233 0 3430 -/+ buffers/cache:276 3668 Swap:0 0 0 14:42:50 up 73 days, 3:32, 2 users, load average: 0.00, 0.06, 0.21 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314400 in memory on computer1.ant-net:60820 (size: 826.1 KB, free: 542.9 MB) 14/04/04 14:24:56 INFO MemoryStore: ensureFreeSpace(845956) called with curMem=47278100, maxMem=825439027 14/04/04 14:24:56 INFO MemoryStore: Block input-0-1396614314400 stored as bytes to memory (size 826.1 KB, free 741.3 MB) 14/04/04 14:24:56 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614314400 in memory on computer8.ant-net:49743 (size: 826.1 KB, free: 741.3 MB) 14/04/04 14:24:56 INFO BlockManagerMaster: Updated info of block input-0-1396614314400 14/04/04 14:24:56 INFO TaskSetManager: Finished TID 272 in 84 ms on computer1.ant-net (progress: 0/1) 14/04/04 14:24:56 INFO TaskSchedulerImpl: Remove TaskSet 43.0 from pool 14/04/04 14:24:56 INFO DAGScheduler: Completed ResultTask(43, 0) 14/04/04 14:24:56 INFO DAGScheduler: Stage 43 (take at DStream.scala:594) finished in 0.088 s 14/04/04 14:24:56 INFO SparkContext: Job finished: take at DStream.scala:594, took 1.872875734 s --- Time: 1396614289000 ms --- (Santiago,1) (liveliness,1) (Sun,1) (reapers,1) (offer,3) (BARBER,3) (shrewdness,1) (truism,1) (hits,1) (merchant,1) # End Test computer1 total used free sharedbuffers cached Mem: 3945 2209 1735 0 5773 -/+ buffers/cache: 1430 2514 Swap:0 0 0 14:46:05 up 73 days, 3:35, 2 users, load average: 2.69, 1.07, 0.55 14/04/04 14:26:57 INFO TaskSetManager: Starting task 183.0:0 as TID 696 on executor 0: computer1.ant-net (PROCESS_LOCAL) 14/04/04 14:26:57 INFO TaskSetManager: Serialized task 183.0:0 as 1981 bytes in 0 ms 14/04/04 14:26:57 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 81 to sp...@computer1.ant-net:44817 14/04/04 14:26:57 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 81 is 212 bytes 14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614336600 on disk on computer1.ant-net:60820 (size: 1441.7 KB) 14/04/04 14:26:57 INFO BlockManagerMasterActor$BlockManagerInfo: Added input-0-1396614435200 in memory on computer1.ant-net:60820 (size: 1295.7 KB, free: 589.3 KB) 14/04/04 14:26:57 INFO TaskSetManager: Finished TID 696 in 56 ms on computer1.ant-net (progress: 0/1) 14/04/04 14:26:57 INFO TaskSchedulerImpl: Remove TaskSet 183.0 from pool 14/04/04 14:26:57 INFO DAGScheduler: Completed ResultTask(183, 0) 14/04/04 14:26:57 INFO DAGScheduler: Stage 183 (take at DStream.scala:594) finished in 0.057 s 14/04/04 14:26:57 INFO SparkContext: Job finished: take at DStream.scala:594, took 1.575268894 s --- Time: 1396614359000 ms --- (hapless,9) (reapers,8) (amazed,113) (feebleness,7) (offer,148) (rabble,27) (exchanging,7) (merchant,20) (incentives,2) (quarrel,48) ... Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
how to save RDD partitions in different folders?
Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0 part1/part-1 part1/part-k thanks Dimitri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark 1.0.0 release plan
Do we have a list of things we really want to get in for 1.X? Perhaps move any jira out to a 1.1 release if we aren't targetting them for 1.0. It might be nice to send out reminders when these dates are approaching. Tom On Thursday, April 3, 2014 11:19 PM, Bhaskar Dutta bhas...@gmail.com wrote: Thanks a lot guys! On Fri, Apr 4, 2014 at 5:34 AM, Patrick Wendell pwend...@gmail.com wrote: Btw - after that initial thread I proposed a slightly more detailed set of dates: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - Patrick On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Bhaskar, this is still the plan, though QAing might take longer than 15 days. Right now since we’ve passed April 1st, the only features considered for a merge are those that had pull requests in review before. (Some big ones are things like annotating the public APIs and simplifying configuration). Bug fixes and things like adding Python / Java APIs for new components will also still be considered. Matei On Apr 3, 2014, at 10:30 AM, Bhaskar Dutta bhas...@gmail.com wrote: Hi, Is there any change in the release plan for Spark 1.0.0-rc1 release date from what is listed in the Proposal for Spark Release Strategy thread? == Tentative Release Window for 1.0.0 == Feb 1st - April 1st: General development April 1st: Code freeze for new features April 15th: RC1 Thanks, Bhaskar
Re: How to create a RPM package
Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file, which you can probably adapt to your need. Christophe. On 04/04/2014 09:10, Rahul Singhal wrote: Hello Community, This is my first mail to the list and I have a small question. The maven build pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages mentions a way to create a debian package but I was wondering if there is a simple way (preferably through maven) to create a RPM package. Is there a script (which is probably used for spark releases) that I can get my hands on? Or should I write one on my own? P.S. I don't want to use the alien software to convert a debian package to a RPM. Thanks, Rahul Singhal Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur. Name: spark Version: 0.9.0 # Build time settings %global _full_version %{version}-incubating %global _final_name %{name}-%{_full_version} %global _spark_hadoop_version 2.2.0 %global _spark_dir /opt Release: 2 Summary: Lightning-fast cluster computing Group:Development/Libraries License: ASL 2.0 URL: http://spark.apache.org/ Source0: http://www.eu.apache.org/dist/incubator/spark/%{_final_name}/%{_final_name}.tgz BuildRequires: git Requires: /bin/bash Requires: /bin/sh Requires: /usr/bin/env %description Apache Spark is a fast and general engine for large-scale data processing. %prep %setup -q -n %{_final_name} %build SPARK_HADOOP_VERSION=%{_spark_hadoop_version} SPARK_YARN=true ./sbt/sbt assembly find bin -type f -name '*.cmd' -exec rm -f {} \; %install mkdir -p ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/{conf,jars} echo Spark %{_full_version} built for Hadoop %{_spark_hadoop_version} ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/RELEASE cp assembly/target/scala*/spark-assembly-%{_full_version}-hadoop%{_spark_hadoop_version}.jar ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/jars/spark-assembly-hadoop.jar cp conf/*.template ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name}/conf cp -r bin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name} cp -r python ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name} cp -r sbin ${RPM_BUILD_ROOT}%{_spark_dir}/%{name}/%{_final_name} %files %defattr(-,root,root,-) %{_spark_dir}/%{name} %changelog * Mon Mar 31 2014 Christophe Préaud christophe.pre...@kelkoo.com 0.9.0-2 - Use description and Summary from Fedora RPM * Wed Mar 26 2014 Christophe Préaud christophe.pre...@kelkoo.com 0.9.0-1 - first version with changelog :-)
Driver increase memory utilization
Hi Guys, Could anyone help me understand this driver behavior when I start the JavaNetworkWordCount? computer8 16:24:07 up 121 days, 22:21, 12 users, load average: 0.66, 1.27, 1.55 total used free shared buffers cached Mem: 5897 4341 1555 0227 2798 -/+ buffers/cache: 1315 4581 Swap:0 0 0 in 2 minutes computer8 16:23:08 up 121 days, 22:20, 12 users, load average: 0.80, 1.43, 1.62 total used free shared buffers cached Mem: 5897 5866 30 0230 3255 -/+ buffers/cache: 2380 3516 Swap:0 0 0 Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Hadoop 2.X Spark Client Jar 0.9.0 problem
Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz or direct file downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz I get a jar with what appears to be hadoop 1.0.4 that fails when using hadoop 2.3.0. I have tried repeatedly to build the source tree with the correct options per the documentation but always seemingly ending up with hadoop 1.0.4. As far as I can tell the reason that the jar available on the web site doesn't have the correct hadoop client in it, is because the build itself is having that problem. I am about to try to troubleshoot the build but wanted to see if anyone out there has encountered the same problem and/or if I am just doing something dumb (!) Anyone else using hadoop 2.X? How do you get the right client jar if so? cheers, Erik -- Erik James Freed CoDecision Software 510.859.3360 erikjfr...@codecision.com 1480 Olympus Avenue Berkeley, CA 94708 179 Maria Lane Orcas, WA 98245
Re: Hadoop 2.X Spark Client Jar 0.9.0 problem
Hi Erik, I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for maven build: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package And from http://spark.apache.org/docs/latest/running-on-yarn.html, for sbt build, you could try: SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly Thanks, Rahul Singhal From: Erik Freed erikjfr...@codecision.commailto:erikjfr...@codecision.com Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date: Friday 4 April 2014 7:58 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: * Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz or direct file downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz I get a jar with what appears to be hadoop 1.0.4 that fails when using hadoop 2.3.0. I have tried repeatedly to build the source tree with the correct options per the documentation but always seemingly ending up with hadoop 1.0.4. As far as I can tell the reason that the jar available on the web site doesn't have the correct hadoop client in it, is because the build itself is having that problem. I am about to try to troubleshoot the build but wanted to see if anyone out there has encountered the same problem and/or if I am just doing something dumb (!) Anyone else using hadoop 2.X? How do you get the right client jar if so? cheers, Erik -- Erik James Freed CoDecision Software 510.859.3360 erikjfr...@codecision.commailto:erikjfr...@codecision.com 1480 Olympus Avenue Berkeley, CA 94708 179 Maria Lane Orcas, WA 98245
Re: Hadoop 2.X Spark Client Jar 0.9.0 problem
I believe you got to set following SPARK_HADOOP_VERSION=2.2.0 (or whatever your version is) SPARK_YARN=true then type sbt/sbt assembly If you are using Maven to compile mvn -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests clean package Hope this helps -A On Fri, Apr 4, 2014 at 7:28 AM, Erik Freed erikjfr...@codecision.comwrote: Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz or direct file downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz I get a jar with what appears to be hadoop 1.0.4 that fails when using hadoop 2.3.0. I have tried repeatedly to build the source tree with the correct options per the documentation but always seemingly ending up with hadoop 1.0.4. As far as I can tell the reason that the jar available on the web site doesn't have the correct hadoop client in it, is because the build itself is having that problem. I am about to try to troubleshoot the build but wanted to see if anyone out there has encountered the same problem and/or if I am just doing something dumb (!) Anyone else using hadoop 2.X? How do you get the right client jar if so? cheers, Erik -- Erik James Freed CoDecision Software 510.859.3360 erikjfr...@codecision.com 1480 Olympus Avenue Berkeley, CA 94708 179 Maria Lane Orcas, WA 98245
Re: how to save RDD partitions in different folders?
Hi Evan, Could you please provide a code-snippet? Because it not clear for me, in Hadoop you need to engage addNamedOutput method and I'm in stuck how to use it from Spark Thank you, Konstantin Kudryavtsev On Fri, Apr 4, 2014 at 5:27 PM, Evan Sparks evan.spa...@gmail.com wrote: Have a look at MultipleOutputs in the hadoop API. Spark can read and write to arbitrary hadoop formats. On Apr 4, 2014, at 6:01 AM, dmpour23 dmpou...@gmail.com wrote: Hi all, Say I have an input file which I would like to partition using HashPartitioner k times. Calling rdd.saveAsTextFile(hdfs://); will save k files as part-0 part-k Is there a way to save each partition in specific folders? i.e. src part0/part-0 part1/part-1 part1/part-k thanks Dimitri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-save-RDD-partitions-in-different-folders-tp3754.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Job initialization performance of Spark standalone mode vs YARN
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running within the AppMaster instance in this run mode? How exactly does yarn-standalone work? Thanks, Ron Sent from my iPhone On Apr 3, 2014, at 11:19 AM, Kevin Markey kevin.mar...@oracle.com wrote: We are now testing precisely what you ask about in our environment. But Sandy's questions are relevant. The bigger issue is not Spark vs. Yarn but client vs. standalone and where the client is located on the network relative to the cluster. The client options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster. The standalone options locate the master/client on the cluster. In yarn-standalone, the master is a thread contained by the Yarn Resource Manager. Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency. In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone! At least for a job with lots of stages and lots of client/worker communication, although rather few collect actions, so it's mainly scheduling that's relevant here. I'll be posting more information as I have it available. Kevin On 03/03/2014 03:48 PM, Sandy Ryza wrote: Are you running in yarn-standalone mode or yarn-client mode? Also, what YARN scheduler and what NodeManager heartbeat? On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing at capacity in under a second :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Hadoop 2.X Spark Client Jar 0.9.0 problem
Thanks all for the update - I have actually built using those options every which way I can think of so perhaps this is something I am doing about how I upload the jar to our artifactory repo server. Anyone have a working pom file for the publish of a spark 0.9 hadoop 2.X publish to a maven repo server? cheers, Erik On Fri, Apr 4, 2014 at 7:54 AM, Rahul Singhal rahul.sing...@guavus.comwrote: Hi Erik, I am working with TOT branch-0.9 ( 0.9.1) and the following works for me for maven build: export MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m mvn -Pyarn -Dhadoop.version=2.3.0 -Dyarn.version=2.3.0 -DskipTests clean package And from http://spark.apache.org/docs/latest/running-on-yarn.html, for sbt build, you could try: SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly Thanks, Rahul Singhal From: Erik Freed erikjfr...@codecision.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Friday 4 April 2014 7:58 PM To: user@spark.apache.org user@spark.apache.org Subject: Hadoop 2.X Spark Client Jar 0.9.0 problem Hi All, I am not sure if this is a 0.9.0 problem to be fixed in 0.9.1 so perhaps already being addressed, but I am having a devil of a time with a spark 0.9.0 client jar for hadoop 2.X. If I go to the site and download: - Download binaries for Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-hadoop2.tgz or direct file downloadhttp://d3kbcqa49mib13.cloudfront.net/spark-0.9.0-incubating-bin-hadoop2.tgz I get a jar with what appears to be hadoop 1.0.4 that fails when using hadoop 2.3.0. I have tried repeatedly to build the source tree with the correct options per the documentation but always seemingly ending up with hadoop 1.0.4. As far as I can tell the reason that the jar available on the web site doesn't have the correct hadoop client in it, is because the build itself is having that problem. I am about to try to troubleshoot the build but wanted to see if anyone out there has encountered the same problem and/or if I am just doing something dumb (!) Anyone else using hadoop 2.X? How do you get the right client jar if so? cheers, Erik -- Erik James Freed CoDecision Software 510.859.3360 erikjfr...@codecision.com 1480 Olympus Avenue Berkeley, CA 94708 179 Maria Lane Orcas, WA 98245 -- Erik James Freed CoDecision Software 510.859.3360 erikjfr...@codecision.com 1480 Olympus Avenue Berkeley, CA 94708 179 Maria Lane Orcas, WA 98245
Parallelism level
Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Regarding Sparkcontext object
On Wed, Apr 2, 2014 at 7:11 PM, yh18190 yh18...@gmail.com wrote: Is it always needed that sparkcontext object be created in Main method of class.Is it necessary?Can we create sc object in other class and try to use it by passing this object through function and use it? The Spark context can be initialized wherever you like and passed around just as any other object. Just don't try to create multiple contexts against local (without stopping the previous one first), or you may get ArrayStoreExceptions (I learned that one the hard way). -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1
Hi Wisely, Could you please post your pom.xml here. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-reading-HDFS-file-using-spark-0-9-0-hadoop-2-2-0-incompatible-protobuf-2-5-and-2-4-1-tp2158p3770.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RAM Increase
Hi Guys, Could anyone explain me this behavior? After 2 min of tests computer1- worker computer10 - worker computer8 - driver(master) computer1 18:24:31 up 73 days, 7:14, 1 user, load average: 3.93, 2.45, 1.14 total used free shared buffers cached Mem: 3945 3925 19 0 18 1368 -/+ buffers/cache: 2539 1405 Swap:0 0 0 computer10 18:22:38 up 44 days, 21:26, 2 users, load average: 3.05, 2.20, 1.03 total used free shared buffers cached Mem: 5897 5292 604 0 46 2707 -/+ buffers/cache: 2538 3358 Swap:0 0 0 computer8 18:24:13 up 122 days, 22 min, 13 users, load average: 1.10, 0.93, 0.82 total used free shared buffers cached Mem: 5897 5841 55 0113 2747 -/+ buffers/cache: 2980 2916 Swap:0 0 0 Thanks Guys -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Parallelism level
If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer...
Hi Francis, This might be a long shot, but do you happen to have built spark on an encrypted home dir? (I was running into the same error when I was doing that. Rebuilding on an unencrypted disk fixed the issue. This is a known issue / limitation with ecryptfs. It's weird that the build doesn't fail, but you do get warnings about the long file names.) On Wed, Apr 2, 2014 at 3:26 AM, Francis.Hu francis...@reachjunction.com wrote: I stuck in a NoClassDefFoundError. Any helps that would be appreciated. I download spark 0.9.0 source, and then run this command to build it : SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly java.lang.NoClassDefFoundError: scala/tools/nsc/transform/UnCurry$UnCurryTransformer$$anonfun$14$$anonfun$apply$5$$anonfun$scala$tools$nsc$transform$UnCurry$UnCurryTransformer$$anonfun$$anonfun$$transformInConstructor$1$1 -- Marcelo
How are exceptions in map functions handled in Spark?
I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However, right now we're seeing the task just re-tried over and over again in an infinite loop because there's a value that always generates an exception. John
Re: How are exceptions in map functions handled in Spark?
Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which should always work. I’d recommend trying a newer Spark version, 0.8 should be easy to upgrade to from 0.7. Matei On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote: I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However, right now we're seeing the task just re-tried over and over again in an infinite loop because there's a value that always generates an exception. John
Re: Example of creating expressions for SchemaRDD methods
In such construct, each operator builds on the previous one, including any materialized results etc. If I use a SQL for each of them, I suspect the later SQLs will not leverage the earlier SQLs by any means - hence these will be inefficient to first approach. Let me know if this is not correct. This is not correct. When you run a SQL statement and register it as a table, it is the logical plan for this query is used when this virtual table is referenced in later queries, not the results. SQL queries are lazy, just like RDDs and DSL queries. This is illustrated below. scala sql(SELECT * FROM selectQuery) res3: org.apache.spark.sql.SchemaRDD = SchemaRDD[12] at RDD at SchemaRDD.scala:93 == Query Plan == HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None scala sql(SELECT * FROM src).registerAsTable(selectQuery) scala sql(SELECT key FROM selectQuery) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[24] at RDD at SchemaRDD.scala:93 == Query Plan == HiveTableScan [key#8], (MetastoreRelation default, src, None), None Even though the second query is running over the results of the first query (which requested all columns using *), the optimizer is still able to come up with an efficient plan that avoids reading value from the table, as can be seen by the arguments of the HiveTableScan. Note that if you call sqlContext.cacheTable(selectQuery) then you are correct. The results will be materialized in an in-memory columnar format, and subsequent queries will be run over these materialized results. The reason for building expressions is that the use case needs these to be created on the fly based on some case class at runtime. I.e., I can't type these in REPL. The scala code will define some case class A (a: ... , b: ..., c: ... ) where class name, member names and types will be known before hand and the RDD will be defined on this. Then based on user action, above pipeline needs to be constructed on fly. Thus the expressions has to be constructed on fly from class members and other predicates etc., most probably using expression constructors. Could you please share how expressions could be constructed using the APIs on expression (and not on REPL) ? I'm not sure I completely understand the use case here, but you should be able to construct symbols and use the DSL to create expressions at runtime, just like in the REPL. val attrName: String = name val addExpression: Expression = Symbol(attrName) + Symbol(attrName) There is currently no public API for constructing expressions manually other than SQL or the DSL. While you could dig into org.apache.spark.sql.catalyst.expressions._, these APIs are considered internal, and *will not be stable in between versions*. Michael
Re: Parallelism level
What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: How are exceptions in map functions handled in Spark?
Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which should always work. I'd recommend trying a newer Spark version, 0.8 should be easy to upgrade to from 0.7. Matei On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote: I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However, right now we're seeing the task just re-tried over and over again in an infinite loop because there's a value that always generates an exception. John
Re: How are exceptions in map functions handled in Spark?
Btw, thank you for your help. On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote: Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which should always work. I'd recommend trying a newer Spark version, 0.8 should be easy to upgrade to from 0.7. Matei On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote: I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However, right now we're seeing the task just re-tried over and over again in an infinite loop because there's a value that always generates an exception. John
Re: Example of creating expressions for SchemaRDD methods
Minor typo in the example. The first SELECT statement should actually be: sql(SELECT * FROM src) Where `src` is a HiveTable with schema (key INT value STRING). On Fri, Apr 4, 2014 at 11:35 AM, Michael Armbrust mich...@databricks.comwrote: In such construct, each operator builds on the previous one, including any materialized results etc. If I use a SQL for each of them, I suspect the later SQLs will not leverage the earlier SQLs by any means - hence these will be inefficient to first approach. Let me know if this is not correct. This is not correct. When you run a SQL statement and register it as a table, it is the logical plan for this query is used when this virtual table is referenced in later queries, not the results. SQL queries are lazy, just like RDDs and DSL queries. This is illustrated below. scala sql(SELECT * FROM selectQuery) res3: org.apache.spark.sql.SchemaRDD = SchemaRDD[12] at RDD at SchemaRDD.scala:93 == Query Plan == HiveTableScan [key#4,value#5], (MetastoreRelation default, src, None), None scala sql(SELECT * FROM src).registerAsTable(selectQuery) scala sql(SELECT key FROM selectQuery) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[24] at RDD at SchemaRDD.scala:93 == Query Plan == HiveTableScan [key#8], (MetastoreRelation default, src, None), None Even though the second query is running over the results of the first query (which requested all columns using *), the optimizer is still able to come up with an efficient plan that avoids reading value from the table, as can be seen by the arguments of the HiveTableScan. Note that if you call sqlContext.cacheTable(selectQuery) then you are correct. The results will be materialized in an in-memory columnar format, and subsequent queries will be run over these materialized results. The reason for building expressions is that the use case needs these to be created on the fly based on some case class at runtime. I.e., I can't type these in REPL. The scala code will define some case class A (a: ... , b: ..., c: ... ) where class name, member names and types will be known before hand and the RDD will be defined on this. Then based on user action, above pipeline needs to be constructed on fly. Thus the expressions has to be constructed on fly from class members and other predicates etc., most probably using expression constructors. Could you please share how expressions could be constructed using the APIs on expression (and not on REPL) ? I'm not sure I completely understand the use case here, but you should be able to construct symbols and use the DSL to create expressions at runtime, just like in the REPL. val attrName: String = name val addExpression: Expression = Symbol(attrName) + Symbol(attrName) There is currently no public API for constructing expressions manually other than SQL or the DSL. While you could dig into org.apache.spark.sql.catalyst.expressions._, these APIs are considered internal, and *will not be stable in between versions*. Michael
Largest Spark Cluster
Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes but don't know the actual number. can someone share? Thanks
Re: Parallelism level
If you want more parallelism, you need more cores. So, use a machine with more cores, or use a cluster of machines. spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the easiest way to do this. If you're stuck on a single machine with 2 cores, then set your default parallelism to 2. Setting it to a higher number won't do anything helpful. On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: example of non-line oriented input data?
FYI, one thing we’ve added now is support for reading multiple text files from a directory as separate records: https://github.com/apache/spark/pull/327. This should remove the need for mapPartitions discussed here. Avro and SequenceFiles look like they may not make it for 1.0, but there’s a chance that Parquet support with Spark SQL will, which should let you store binary data a bit better. Matei On Mar 19, 2014, at 3:12 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Another vote on this, support for simple SequenceFiles and/or Avro would be terrific, as using plain text can be very space-inefficient, especially for numerical data. -- Jeremy On Mar 19, 2014, at 5:24 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I'd second the request for Avro support in Python first, followed by Parquet. On Wed, Mar 19, 2014 at 2:14 PM, Evgeny Shishkin itparan...@gmail.com wrote: On 19 Mar 2014, at 19:54, Diana Carroll dcarr...@cloudera.com wrote: Actually, thinking more on this question, Matei: I'd definitely say support for Avro. There's a lot of interest in this!! Agree, and parquet as default Cloudera Impala format. On Tue, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW one other thing — in your experience, Diana, which non-text InputFormats would be most useful to support in Python first? Would it be Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or something else? I think a per-file text input format that does the stuff we did here would also be good. Matei On Mar 18, 2014, at 3:27 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Diana, This seems to work without the iter() in front if you just return treeiterator. What happened when you didn’t include that? Treeiterator should return an iterator. Anyway, this is a good example of mapPartitions. It’s one where you want to view the whole file as one object (one XML here), so you couldn’t implement this using a flatMap, but you still want to return multiple values. The MLlib example you saw needs Python 2.7 because unfortunately that is a requirement for our Python MLlib support (see http://spark.incubator.apache.org/docs/0.9.0/python-programming-guide.html#libraries). We’d like to relax this later but we’re using some newer features of NumPy and Python. The rest of PySpark works on 2.6. In terms of the size in memory, here both the string s and the XML tree constructed from it need to fit in, so you can’t work on very large individual XML files. You may be able to use a streaming XML parser instead to extract elements from the data in a streaming fashion, without every materializing the whole tree. http://docs.python.org/2/library/xml.sax.reader.html#module-xml.sax.xmlreader is one example. Matei On Mar 18, 2014, at 7:49 AM, Diana Carroll dcarr...@cloudera.com wrote: Well, if anyone is still following this, I've gotten the following code working which in theory should allow me to parse whole XML files: (the problem was that I can't return the tree iterator directly. I have to call iter(). Why?) import xml.etree.ElementTree as ET # two source files, format data country name=../country.../data mydata=sc.textFile(file:/home/training/countries*.xml) def parsefile(iterator): s = '' for i in iterator: s = s + str(i) tree = ET.fromstring(s) treeiterator = tree.getiterator(country) # why to I have to convert an iterator to an iterator? not sure but required return iter(treeiterator) mydata.mapPartitions(lambda x: parsefile(x)).map(lambda element: element.attrib).collect() The output is what I expect: [{'name': 'Liechtenstein'}, {'name': 'Singapore'}, {'name': 'Panama'}] BUT I'm a bit concerned about the construction of the string s. How big can my file be before converting it to a string becomes problematic? On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll dcarr...@cloudera.com wrote: Thanks, Matei. In the context of this discussion, it would seem mapParitions is essential, because it's the only way I'm going to be able to process each file as a whole, in our example of a large number of small XML files which need to be parsed as a whole file because records are not required to be on a single line. The theory makes sense but I'm still utterly lost as to how to implement it. Unfortunately there's only a single example of the use of mapPartitions in any of the Python example programs, which is the log regression example, which I can't run because it requires Python 2.7 and I'm on Python 2.6. (aside: I couldn't find any statement that Python 2.6 is unsupported...is it?) I'd really really love to see a real life example of a Python use of mapPartitions. I do appreciate the very simple examples you provided, but (perhaps because of my novice status on Python) I can't figure out
Re: Having spark-ec2 join new slaves to existing cluster
This can’t be done through the script right now, but you can do it manually as long as the cluster is stopped. If the cluster is stopped, just go into the AWS Console, right click a slave and choose “launch more of these” to add more. Or select multiple slaves and delete them. When you run spark-ec2 start the next time to start your cluster, it will set it up on all the machines it finds in the mycluster-slaves security group. This is pretty hacky so it would definitely be good to add this feature; feel free to open a JIRA about it. Matei On Apr 4, 2014, at 12:16 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I would like to be able to use spark-ec2 to launch new slaves and add them to an existing, running cluster. Similarly, I would also like to remove slaves from an existing cluster. Use cases include: Oh snap, I sized my cluster incorrectly. Let me add/remove some slaves. During scheduled batch processing, I want to add some new slaves, perhaps on spot instances. When that processing is done, I want to kill them. (Cruel, I know.) I gather this is not possible at the moment. spark-ec2 appears to be able to launch new slaves for an existing cluster only if the master is stopped. I also do not see any ability to remove slaves from a cluster. Is that correct? Are there plans to add such functionality to spark-ec2 in the future? Nick View this message in context: Having spark-ec2 join new slaves to existing cluster Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to create a RPM package
Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build section) as I don't want to be maintaining the list of files that need to be packaged. I ended up adding maven build support to make-distribution.sh. This script produces a tar ball which I can then use to create a RPM package. Thanks, Rahul Singhal From: Christophe Préaud christophe.pre...@kelkoo.commailto:christophe.pre...@kelkoo.com Reply-To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Date: Friday 4 April 2014 7:55 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: How to create a RPM package Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file, which you can probably adapt to your need. Christophe. On 04/04/2014 09:10, Rahul Singhal wrote: Hello Community, This is my first mail to the list and I have a small question. The maven build pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages mentions a way to create a debian package but I was wondering if there is a simple way (preferably through maven) to create a RPM package. Is there a script (which is probably used for spark releases) that I can get my hands on? Or should I write one on my own? P.S. I don't want to use the alien software to convert a debian package to a RPM. Thanks, Rahul Singhal Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: reduceByKeyAndWindow Java
Hi Tathagata, You are right, this code compile, but I am some problems with high memory consummation, I sent today some email about this, but no response until now. Thanks Em 4/4/14, 22:56, Tathagata Das escreveu: I havent really compiled the code, but it looks good to me. Why? Is there any problem you are facing? TD On Fri, Apr 4, 2014 at 8:03 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi guys, I would like knowing if the part of code is right to use in Window. JavaPairDStreamString, Integer wordCounts = words.map( 103 new PairFunctionString, String, Integer() { 104 @Override 105 public Tuple2String, Integer call(String s) { 106 return new Tuple2String, Integer(s, 1); 107 } 108 }).reduceByKeyAndWindow( 109 new Function2Integer, Integer, Integer() { 110 public Integer call(Integer i1, Integer i2) { return i1 + i2; } 111 }, 112 new Function2Integer, Integer, Integer() { 113 public Integer call(Integer i1, Integer i2) { return i1 - i2; } 114 }, 115 new Duration(60 * 5 * 1000), 116 new Duration(1 * 1000) 117 ); Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: How are exceptions in map functions handled in Spark?
Logging inside a map function shouldn't freeze things. The messages should be logged on the worker logs, since the code is executed on the executors. If you throw a SparkException, however, it'll be propagated to the driver after it has failed 4 or more times (by default). On Fri, Apr 4, 2014 at 11:57 AM, John Salvatier jsalvat...@gmail.comwrote: Btw, thank you for your help. On Fri, Apr 4, 2014 at 11:49 AM, John Salvatier jsalvat...@gmail.comwrote: Is there a way to log exceptions inside a mapping function? logError and logInfo seem to freeze things. On Fri, Apr 4, 2014 at 11:02 AM, Matei Zaharia matei.zaha...@gmail.comwrote: Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which should always work. I'd recommend trying a newer Spark version, 0.8 should be easy to upgrade to from 0.7. Matei On Apr 4, 2014, at 10:40 AM, John Salvatier jsalvat...@gmail.com wrote: I'm trying to get a clear idea about how exceptions are handled in Spark? Is there somewhere where I can read about this? I'm on spark .7 For some reason I was under the impression that such exceptions are swallowed and the value that produced them ignored but the exception is logged. However, right now we're seeing the task just re-tried over and over again in an infinite loop because there's a value that always generates an exception. John
Re: Spark output compression on HDFS
There is no compress type for snappy. Sent from my iPhone5s On 2014年4月4日, at 23:06, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Can anybody suggest how to change compression level (Record, Block) for Snappy? if it possible, of course thank you in advance Thank you, Konstantin Kudryavtsev On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Thanks all, it works fine now and I managed to compress output. However, I am still in stuck... How is it possible to set compression type for Snappy? I mean to set up record or block level of compression for output On Apr 3, 2014 1:15 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for pointing that out. On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.com wrote: First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point. On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Is this a Scala-only feature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra m...@clearstorydata.com wrote: http://www.scala-lang.org/api/2.10.3/index.html#scala.Option The signature is 'def saveAsSequenceFile(path: String, codec: Option[Class[_ : CompressionCodec]] = None)', but you are providing a Class, not an Option[Class]. Try counts.saveAsSequenceFile(output, Some(classOf[org.apache.hadoop.io.compress.SnappyCodec])) On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev kudryavtsev.konstan...@gmail.com wrote: Hi there, I've started using Spark recently and evaluating possible use cases in our company. I'm trying to save RDD as compressed Sequence file. I'm able to save non-compressed file be calling: counts.saveAsSequenceFile(output) where counts is my RDD (IntWritable, Text). However, I didn't manage to compress output. I tried several configurations and always got exception: counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec]) console:21: error: type mismatch; found : Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec]) required: Option[Class[_ : org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.SnappyCodec]) counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec]) console:21: error: type mismatch; found : Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec]) required: Option[Class[_ : org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.spark.io.SnappyCompressionCodec]) and it doesn't work even for Gzip: counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec]) console:21: error: type mismatch; found : Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec]) required: Option[Class[_ : org.apache.hadoop.io.compress.CompressionCodec]] counts.saveAsSequenceFile(output, classOf[org.apache.hadoop.io.compress.GzipCodec]) Could you please suggest solution? also, I didn't find how is it possible to specify compression parameters (i.e. compression type for Snappy). I wondered if you could share code snippets for writing/reading RDD with compression? Thank you in advance, Konstantin Kudryavtsev
Spark on other parallel filesystems
All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the HDFS substrate) for a start, or will that not work at all? I do know that it’s possible to implement Tachyon on Lustre and get the HDFS interface – just looking at other options. Venkat
Re: Spark on other parallel filesystems
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a network-mounted file system though. Matei On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com wrote: All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the HDFS substrate) for a start, or will that not work at all? I do know that it’s possible to implement Tachyon on Lustre and get the HDFS interface – just looking at other options. Venkat
Re: Avro serialization
Thanks will take a look... Sent from my iPad On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at: https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala Also, Matt Massie has a good blog post about this at http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Thu, Apr 3, 2014 at 7:16 AM, Ian O'Connell i...@ianoconnell.com wrote: Objects been transformed need to be one of these in flight. Source data can just use the mapreduce input formats, so anything you can do with mapred. doing an avro one for this you probably want one of : https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf* or just whatever your using at the moment to open them in a MR job probably could be re-purposed On Thu, Apr 3, 2014 at 7:11 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, I know that sources need to either be java serializable or use kryo serialization. Does anyone have sample code that reads, transforms and writes avro files in spark? Thanks, Ron
exactly once
Does spark in general assure exactly once semantics? What happens to those guarantees in the presence of updateStateByKey operations -- are they also assured to be exactly once? Thanks manku.timma at outlook dot com
Re: Largest Spark Cluster
Hey Parviz, There was a similar thread a while ago... I think that many companies like to be discrete about the size of large clusters. But of course it would be great if people wanted to share openly :) For my part - I can say that Spark has been benchmarked on hundreds-of-nodes clusters before and on jobs that crunch hundreds of terabytes (uncompressed) of data. - Patrick On Fri, Apr 4, 2014 at 12:05 PM, Parviz Deyhim pdey...@gmail.com wrote: Spark community, What's the size of the largest Spark cluster ever deployed? I've heard Yahoo is running Spark on several hundred nodes but don't know the actual number. can someone share? Thanks
Re: How to create a RPM package
We might be able to incorporate the maven rpm plugin into our build. If that can be done in an elegant way it would be nice to have that distribution target for people who wanted to try this with arbitrary Spark versions... Personally I have no familiarity with that plug-in, so curious if anyone in the community has feedback from trying this. - Patrick On Fri, Apr 4, 2014 at 12:43 PM, Rahul Singhal rahul.sing...@guavus.comwrote: Hi Christophe, Thanks for your reply and the spec file. I have solved my issue for now. I didn't want to rely building spark using the spec file (%build section) as I don't want to be maintaining the list of files that need to be packaged. I ended up adding maven build support to make-distribution.sh. This script produces a tar ball which I can then use to create a RPM package. Thanks, Rahul Singhal From: Christophe Préaud christophe.pre...@kelkoo.com Reply-To: user@spark.apache.org user@spark.apache.org Date: Friday 4 April 2014 7:55 PM To: user@spark.apache.org user@spark.apache.org Subject: Re: How to create a RPM package Hi Rahul, Spark will be available in Fedora 21 (see: https://fedoraproject.org/wiki/SIGs/bigdata/packaging/Spark), currently scheduled on 2014-10-14 but they already have produced spec files and source RPMs. If you are stuck with EL6 like me, you can have a look at the attached spec file, which you can probably adapt to your need. Christophe. On 04/04/2014 09:10, Rahul Singhal wrote: Hello Community, This is my first mail to the list and I have a small question. The maven build pagehttp://spark.apache.org/docs/latest/building-with-maven.html#building-spark-debian-packages mentions a way to create a debian package but I was wondering if there is a simple way (preferably through maven) to create a RPM package. Is there a script (which is probably used for spark releases) that I can get my hands on? Or should I write one on my own? P.S. I don't want to use the alien software to convert a debian package to a RPM. Thanks, Rahul Singhal -- Kelkoo SAS Société par Actions Simplifiée Au capital de EURO 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Spark on other parallel filesystems
We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location. -- Jeremy On Apr 4, 2014, at 8:12 PM, Matei Zaharia matei.zaha...@gmail.com wrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a network-mounted file system though. Matei On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com wrote: All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the HDFS substrate) for a start, or will that not work at all? I do know that it’s possible to implement Tachyon on Lustre and get the HDFS interface – just looking at other options. Venkat
Re: Spark on other parallel filesystems
On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia matei.zaha...@gmail.comwrote: As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won't expose data locality info to Spark, the way HDFS does. That may not matter if it's a network-mounted file system though. Is the locality querying mechanism specific to HDFS mode, or is it possible to implement plugins in Spark to query location in other ways on other filesystems? I ask because, glusterfs can expose data location of a file through virtual extended attributes and I would be interested in making Spark exploit that locality when the file location is specified as glusterfs:// (or querying the xattr blindly for file://). How much of a difference does data locality make for Spark use cases anyways (since most of the computation happens in memory)? Any sort of numbers? Thanks! Avati Matei On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy ven...@yarcdata.com wrote: All Are there any drawbacks or technical challenges (or any information, really) related to using Spark directly on a global parallel filesystem like Lustre/GPFS? Any idea of what would be involved in doing a minimal proof of concept? Is it just possible to run Spark unmodified (without the HDFS substrate) for a start, or will that not work at all? I do know that it's possible to implement Tachyon on Lustre and get the HDFS interface - just looking at other options. Venkat