Re: Could not compute split, block not found
Are you by any change using only memory in the storage level of the input streams? TD On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in your streaming data, I think the way Spark handles it is very nice, in particular since 1) it does not mix up the order in which items arrived in the stream, so items from a later window will always be processed later, and 2) because an increase in data will not be punished with high load and unresponsive systems, but with disk space consumption instead. However, if all of your windows require t' t processing time (and it's not because you are waiting, but because you actually do some computation), then you are in bad luck, because if you start processing the next window while the previous one is still processed, you have less resources for each and processing will take even longer. However, if you are only waiting (e.g., for network I/O), then maybe you can employ some asynchronous solution where your tasks return immediately and deliver their result via a callback later? Tobias On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Your suggestion is very helpful. I will definitely investigate it. Just curious. Suppose the batch size is t seconds. In practice, does Spark always require the program to finish processing the data of t seconds within t seconds' processing time? Can Spark begin to consume the new batch before finishing processing the next batch? If Spark can do them together, it may save the processing time and solve the problem of data piling up. Thanks! Bill On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote: If your batch size is one minute and it takes more than one minute to process, then I guess that's what causes your problem. The processing of the second batch will not start after the processing of the first is finished, which leads to more and more data being stored and waiting for processing; check the attached graph for a visualization of what I think may happen. Can you maybe do something hacky like throwing away a part of the data so that processing time gets below one minute, then check whether you still get that error? Tobias On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Thanks for your help. I think in my case, the batch size is 1 minute. However, it takes my program more than 1 minute to process 1 minute's data. I am not sure whether it is because the unprocessed data pile up. Do you have an suggestion on how to check it and solve it? Thanks! Bill On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, were you able to process all information in time, or did maybe some unprocessed data pile up? I think when I saw this once, the reason seemed to be that I had received more data than would fit in memory, while waiting for processing, so old data was deleted. When it was time to process that data, it didn't exist any more. Is that a possible reason in your case? Tobias On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi, I am running a spark streaming job with 1 minute as the batch size. It ran around 84 minutes and was killed because of the exception with the following information: java.lang.Exception: Could not compute split, block input-0-1403893740400 not found Before it was killed, it was able to correctly generate output for each batch. Any help on this will be greatly appreciated. Bill
Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0
Hi, I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 LISTEN(after starting master) I tried to execute the following script from the slaves manually but it ends up with the same exception and log.This script is internally executing the java command. /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077 In this case netstat is showing any connection established to master:7077. When we manually execute the java command,the connection is getting established to master. Thanks Regards, Meethu M On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat -na | grep 192.168.125.174) Thanks Best Regards On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, I reinstalled spark,reboot the system,but still I am not able to start the workers.Its throwing the following exception: Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 I doubt the problem is with 192.168.125.174:0. Eventhough the command contains master:7077,why its showing 0 in the log. java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://master:7077 Can somebody tell me a solution. Thanks Regards, Meethu M On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, ya I tried setting another PORT also,but the same problem.. master is set in etc/hosts Thanks Regards, Meethu M On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote: tha's strange, did you try setting the master port to something else (use SPARK_MASTER_PORT). Also you said you are able to start it from the java commandline java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 What is the master ip specified here? is it like you have entry for master in the /etc/hosts? Thanks Best Regards On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, I am running it in a LAN itself..The IP of the master is given correctly. Thanks Regards, Meethu M On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to port 0? 192.168.125.174:0 :/ Check the ip address of that master machine (ifconfig) looks like the ip address has been changed (hoping you are running this machines on a LAN) Thanks Best Regards On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, My Spark(Standalone mode) was running fine till yesterday.But now I am getting the following exeception when I am running start-slaves.sh or start-all.sh slave3: failed to launch org.apache.spark.deploy.worker.Worker: slave3: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) slave3: at java.lang.Thread.run(Thread.java:662) The log files has the following lines. 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser) 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started 14/06/27 11:06:30 INFO Remoting: Starting remoting Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address ... I saw the same error reported before and have tried the following solutions. Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different number..But nothing is working. When I try to start the worker from the respective machines using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Somebody please give a solution Thanks Regards, Meethu M
Re: Serialization of objects
If you want to stick with Java serialization and need to serialize a non-Serializable object, your best choices are probably to either subclass it with a Serializable one or wrap it in a class of your own which implements its own writeObject/readObject methods (see here: http://stackoverflow.com/questions/6163872/how-to-serialize-a-non-serializable-in-java ) Otherwise you can use Kryo to register custom serializers for other people's objects. On Mon, Jun 30, 2014 at 1:52 PM, Sameer Tilak ssti...@live.com wrote: Hi everyone, I was able to solve this issue. For now I changed the library code and added the following to the class com.wcohen.ss.BasicStringWrapper: public class BasicStringWrapper implements Serializable However, I am still curious to know ho to get around the issue when you don't have access to the code and you are using a 3rd party jar. -- From: ssti...@live.com To: u...@spark.incubator.apache.org Subject: Serialization of objects Date: Thu, 26 Jun 2014 09:30:31 -0700 Hi everyone, Aaron, thanks for your help so far. I am trying to serialize objects that I instantiate from a 3rd party library namely instances of com.wcohen.ss.Jaccard, and com.wcohen.ss.BasicStringWrapper. However, I am having problems with serialization. I am (at least trying to) using Kryo for serialization. I am still facing the serialization issue. I get org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: com.wcohen.ss.BasicStringWrapper Any help with this will be great. Scala code: package approxstrmatch import com.wcohen.ss.BasicStringWrapper; import com.wcohen.ss.Jaccard; import java.util.Iterator; import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.rdd; import org.apache.spark.rdd.RDD; import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.register(classOf[approxstrmatch.JaccardScore]) kryo.register(classOf[com.wcohen.ss.BasicStringWrapper]) kryo.register(classOf[com.wcohen.ss.Jaccard]) } } class JaccardScore { val mjc = new Jaccard() with Serializable val conf = new SparkConf().setMaster(spark://pzxnvm2018:7077).setAppName(ApproxStrMatch) conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) conf.set(spark.kryo.registrator, approxstrmatch.MyRegistrator) val sc = new SparkContext(conf) def calculateScoreSecond (sourcerdd: RDD[String], destrdd: RDD[String]) { val jc_ = this.mjc var i: Int = 0 for (sentence - sourcerdd.toLocalIterator) {val str1 = new BasicStringWrapper (sentence) var scorevector = destrdd.map(x = jc_.score(str1, new BasicStringWrapper(x))) val fileName = new String(/apps/software/scala-approsstrmatch-sentence + i) scorevector.saveAsTextFile(fileName) i += 1 } } Here is the script: val distFile = sc.textFile(hdfs://serverip:54310/data/dummy/sample.txt); val srcFile = sc.textFile(hdfs://serverip:54310/data/dummy/test.txt); val score = new approxstrmatch.JaccardScore() score.calculateScoreSecond(srcFile, distFile) O/P: 14/06/25 12:32:05 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at textFile at console:12), which has no missing parents 14/06/25 12:32:05 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[3] at textFile at console:12) 14/06/25 12:32:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 14/06/25 12:32:05 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL) 14/06/25 12:32:05 INFO TaskSetManager: Serialized task 0.0:0 as 1879 bytes in 4 ms 14/06/25 12:32:05 INFO Executor: Running task ID 0 14/06/25 12:32:05 INFO Executor: Fetching http://serverip:47417/jars/approxstrmatch.jar with timestamp 1403724701564 14/06/25 12:32:05 INFO Utils: Fetching http://serverip:47417/jars/approxstrmatch.jar to /tmp/fetchFileTemp8194323811657370518.tmp 14/06/25 12:32:05 INFO Executor: Adding file:/tmp/spark-397828b5-3e0e-4bb4-b56b-58895eb4d6df/approxstrmatch.jar to class loader 14/06/25 12:32:05 INFO Executor: Fetching http://serverip:47417/jars/secondstring-20140618.jar with timestamp 1403724701562 14/06/25 12:32:05 INFO Utils: Fetching http://serverip:47417/jars/secondstring-20140618.jar to /tmp/fetchFileTemp8711755318201511766.tmp 14/06/25 12:32:06 INFO Executor: Adding file:/tmp/spark-397828b5-3e0e-4bb4-b56b-58895eb4d6df/secondstring-20140618.jar to class loader 14/06/25 12:32:06 INFO BlockManager: Found block broadcast_1 locally 14/06/25 12:32:06 INFO HadoopRDD: Input split: hdfs://serverip:54310/data/dummy/test.txt:0+140 14/06/25 12:32:06 INFO Executor: Serialized size of result for 0 is 717
build spark assign version number myself?
Hi,all: I'm working to compile spark by executing './make-distribution.sh --hadoop 0.20.205.0 --tgz ', after the completion of the compilation I found that the default version number is 1.1.0-SNAPSHOT i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz, who know how to assign version number myself , for example spark-1.1.0-company-bin-0.20.205.tgz . Thanks, majian
issue with running example code
Hi, I am having issue in running scala example code. I have tested and able to run successfully python example code, but when I run the scala code I get this error java.lang.ClassCastException: cannot assign instance of org.apache.spark.examples.SparkPi$$anonfun$1 to field org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of org.apache.spark.rdd.MappedRDD I have compiled spark from the github directly and running with the command as spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar --class org.apache.spark.examples.SparkPi 5 --jars /usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar Any suggestions will be helpful. Thanks, Gurvinder
Questions about disk IOs
Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not. The program is simple: points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..) points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached gradient = new LogisticGrandient(); updater = new SquaredL2Updater(); initWeight = Vectors.sparse(size, new int[]{}, new double[]{}) result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, convergeTol, maxIter, regParam, initWeight); I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster mode. Below are some arguments I am using: —executor-memory 10G —num-executors 50 —executor-cores 2 Storage Using: When caching: Cached Partitions 951 Fraction Cached 100% Size in Memory 215.7GB Size in Tachyon 0.0B Size on Disk 1029.7MB The time cost by every aggregate is around 5 minutes with cache enabled. Lots of disk IOs can be seen on the hadoop node. I have the same result with cache disabled. Should data points caching improve the performance? Should caching decrease the disk IO? Thanks in advance.
Re: build spark assign version number myself?
You can specify a custom name with the --name option. It will still contain 1.1.0-SNAPSHOT, but at least you can specify your company name. If you want to replace SNAPSHOT with your company name, you will have to edit make-distribution.sh and replace the following line: VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep -v INFO | tail -n 1) with something like COMPANYNAME=SoullessMegaCorp VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep -v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME') and so on for other packages that use their own version scheme. On Tue, Jul 1, 2014 at 9:21 AM, majian maj...@nq.com wrote: Hi,all: I'm working to compile spark by executing './make-distribution.sh --hadoop 0.20.205.0 --tgz ', after the completion of the compilation I found that the default version number is 1.1.0-SNAPSHOT i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz, who know how to assign version number myself , for example spark-1.1.0-company-bin-0.20.205.tgz . Thanks, majian
Re: build spark assign version number myself?
Sorry, there's a typo in my previous post, the line should read: VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep -v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME/g') On Tue, Jul 1, 2014 at 10:35 AM, Guillaume Ballet gbal...@gmail.com wrote: You can specify a custom name with the --name option. It will still contain 1.1.0-SNAPSHOT, but at least you can specify your company name. If you want to replace SNAPSHOT with your company name, you will have to edit make-distribution.sh and replace the following line: VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep -v INFO | tail -n 1) with something like COMPANYNAME=SoullessMegaCorp VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep -v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME') and so on for other packages that use their own version scheme. On Tue, Jul 1, 2014 at 9:21 AM, majian maj...@nq.com wrote: Hi,all: I'm working to compile spark by executing './make-distribution.sh --hadoop 0.20.205.0 --tgz ', after the completion of the compilation I found that the default version number is 1.1.0-SNAPSHOT i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz, who know how to assign version number myself , for example spark-1.1.0-company-bin-0.20.205.tgz . Thanks, majian
RSpark installation on Windows
Hi All Can we install RSpark on windows setup of R and use it to access the remote Spark cluster ? Thanks Stuti Awasthi ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects.
Spark Streaming question batch size
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work. Although they start running simultaneously, they might have different number of elements in each time interval. The following is output for two streams which have same number of elements and ran simultaneously. The left most value is the number of elements in each window. If we add the number of elements them, they are same for both streams but we can't compare both streams as they are different in window size and number of windows. Can we somehow make windows based on real time values for both streams? or Can we make windows based on number of elements? (n, (mean, varience, SD)) Stream 1 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025)) (44826,(0.2546925855084064,5042.890184382894,71.0133099100647)) (245466,(0.2857731601728941,5014.411691661449,70.81251084138628)) (154852,(0.21907814309792514,3483.800160602281,59.023725404300606)) (156345,(0.3075668844414613,7449.528181550462,86.31064929399189)) (156603,(0.27785151491351234,5917.809892281489,76.9273026452994)) (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623)) Stream 2 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536)) (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975)) (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792)) (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888)) (269817,(0.16987953223480945,3270.663944782799,57.18971887308766)) (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577)) Regards, Laeeq
Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition
Hi, I am trying to run a project which takes data as a DStream and dumps the data in the Shark table after various operations. I am getting the following error : Exception in thread main org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure: java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Can someone please explain the cause of this error, I am also using a Spark Context with the existing Streaming Context.
Window Size
Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work. Although they start running simultaneously, they might have different number of elements in each time interval. The following is output for two streams which have same number of elements and ran simultaneously. The left most value is the number of elements in each window. If we add the number of elements them, they are same for both streams but we can't compare both streams as they are different in window size and number of windows. Can we somehow make windows based on real time values for both streams? or Can we make windows based on number of elements? (n, (mean, varience, SD)) Stream 1 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025)) (44826,(0.2546925855084064,5042.890184382894,71.0133099100647)) (245466,(0.2857731601728941,5014.411691661449,70.81251084138628)) (154852,(0.21907814309792514,3483.800160602281,59.023725404300606)) (156345,(0.3075668844414613,7449.528181550462,86.31064929399189)) (156603,(0.27785151491351234,5917.809892281489,76.9273026452994)) (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623)) Stream 2 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536)) (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975)) (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792)) (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888)) (269817,(0.16987953223480945,3270.663944782799,57.18971887308766)) (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577)) Regards,Laeeq
java.io.FileNotFoundException: http://IP/broadcast_1
Hi All, We are using shark table to dump the data, we are getting the following error : Exception in thread main org.apache.spark.SparkException: Job aborted: Task 1.0:0 failed 1 times (most recent failure: Exception failure: java.io.FileNotFoundException: http://IP/broadcast_1) We dont know where the error is coming from, can anyone please explain me the casue of this error and how to handle it. The spark.cleaner.ttl is set to 4600, which i guess is more than enough to run the application. Spark Version : 0.9.0-incubating Shark : 0.9.0 - SNAPSHOT Scala : 2.10.3 Thank You Honey Joshi Ideata Analytics
Failed to launch Worker
Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not able to start the workers and connect it to the master using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/x.x.x.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address When I try to start the worker from the slaves using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Thanks Regards, Meethu M
Re: Failed to launch Worker
Is this command working?? java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/ assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 Thanks Best Regards On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not able to start the workers and connect it to the master using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/x.x.x.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address When I try to start the worker from the slaves using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Thanks Regards, Meethu M
Re: Failed to launch Worker
Yes. Thanks Regards, Meethu M On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this command working?? java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 Thanks Best Regards On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not able to start the workers and connect it to the master using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/x.x.x.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address When I try to start the worker from the slaves using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Thanks Regards, Meethu M
RE: Spark 1.0 and Logistic Regression Python Example
Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade the numpy/python for a permanent fix. My current versions of python and numpy are 2.6 and 4.1.9 respectively. Thanks, Sam -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, July 01, 2014 12:14 AM To: user@spark.apache.org Subject: Re: Spark 1.0 and Logistic Regression Python Example You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote: Hi, I modified the example code for logistic regression to compute the error in classification. Please see below. However the code is failing when it makes a call to: labelsAndPreds.filter(lambda (v, p): v != p).count() with the error message (something related to numpy or dot product): File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py, line 65, in predict margin = _dot(x, self._coeff) + self._intercept File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, line 443, in _dot return vec.dot(target) AttributeError: 'numpy.ndarray' object has no attribute 'dot' FYI, I am running the code using spark-submit i.e. ./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py The code is posted below if it will be useful in any way: from math import exp import sys import time from pyspark import SparkContext from pyspark.mllib.classification import LogisticRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from numpy import array # Load and parse the data def parsePoint(line): values = [float(x) for x in line.split(',')] if values[0] == -1: # Convert -1 labels to 0 for MLlib values[0] = 0 return LabeledPoint(values[0], values[1:]) sc = SparkContext(appName=PythonLR) # start timing start = time.time() #start = time.clock() data = sc.textFile(sWAMSpark_train.csv) parsedData = data.map(parsePoint) # Build the model model = LogisticRegressionWithSGD.train(parsedData) #load test data testdata = sc.textFile(sWSpark_test.csv) parsedTestData = testdata.map(parsePoint) # Evaluating the model on test data labelsAndPreds = parsedTestData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) print(Training Error = + str(trainErr)) end = time.time() print(Time is = + str(end - start))
Re: Changing log level of spark
We changed the loglevel to DEBUG by replacing every INFO with DEBUG in /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the cluster. There is some DEBUG output visible in both master and worker but nothing really interesting regarding stages or scheduling. Since we expected a little more than that, there could be 2 possibilites: a) There is still some other unknown way to set the loglevel to debug b) There is not that much log output to be expected in this direction, I looked for logDebug (The log wrapper in spark) in github with 84 results, which means that I doubt that there is not much else to expect. We actually just want to have a little more insight into the system behavior especially when using Shark since we ran into some serious concurrency issues with blocking queries. So much for the background why this is important to us. On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson ilike...@gmail.com wrote: If you're using the spark-ec2 scripts, you may have to change /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that is added to the classpath before Spark's own conf. On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: I have a log4j.xml in src/main/resources with ?xml version=1.0 encoding=UTF-8 ? !DOCTYPE log4j:configuration SYSTEM log4j.dtd log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/; [...] root priority value =warn / appender-ref ref=Console / /root /log4j:configuration and that is included in the jar I package with `sbt assembly`. That works fine for me, at least on the driver. Tobias On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck philiplimb...@gmail.com wrote: Hi! According to https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging , changing log-level is just a matter of creating a log4j.properties (which is in the classpath of spark) and changing log level there for the root logger. I did this steps on every node in the cluster (master and worker nodes). However, after restart there is still no debug output as desired, but only the default info log level.
difference between worker and slave nodes
Can anyone explain to me what is difference between worker and slave? I hav e one master and two slaves which are connected to each other, by using jps command I can see master in master node and worker in slave nodes but I dont have any worker in my master by using this command /bin/spark-classorg.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 I have thought the slaves would be working as worker -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-worker-and-slave-nodes-tp8578.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Changing log level of spark
One thing we ran into was that there was another log4j.properties earlier in the classpath. For us, it was in our MapR/Hadoop conf. If that is the case, something like the following could help you track it down. The only thing to watch out for is that you might have to walk up the classloader hierarchy. ClassLoader cl = Thread.currentThread().getContextClassLoader(); URL loc = cl.getResource(/log4j.properties); System.out.println(loc); -Suren On Tue, Jul 1, 2014 at 9:20 AM, Philip Limbeck philiplimb...@gmail.com wrote: We changed the loglevel to DEBUG by replacing every INFO with DEBUG in /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the cluster. There is some DEBUG output visible in both master and worker but nothing really interesting regarding stages or scheduling. Since we expected a little more than that, there could be 2 possibilites: a) There is still some other unknown way to set the loglevel to debug b) There is not that much log output to be expected in this direction, I looked for logDebug (The log wrapper in spark) in github with 84 results, which means that I doubt that there is not much else to expect. We actually just want to have a little more insight into the system behavior especially when using Shark since we ran into some serious concurrency issues with blocking queries. So much for the background why this is important to us. On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson ilike...@gmail.com wrote: If you're using the spark-ec2 scripts, you may have to change /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that is added to the classpath before Spark's own conf. On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: I have a log4j.xml in src/main/resources with ?xml version=1.0 encoding=UTF-8 ? !DOCTYPE log4j:configuration SYSTEM log4j.dtd log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/; [...] root priority value =warn / appender-ref ref=Console / /root /log4j:configuration and that is included in the jar I package with `sbt assembly`. That works fine for me, at least on the driver. Tobias On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck philiplimb...@gmail.com wrote: Hi! According to https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging , changing log-level is just a matter of creating a log4j.properties (which is in the classpath of spark) and changing log level there for the root logger. I did this steps on every node in the cluster (master and worker nodes). However, after restart there is still no debug output as desired, but only the default info log level. -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Changing log level of spark
Are you looking at the driver log? (e.g. Shark?). I see a ton of information in the INFO category on what query is being started, what stage is starting and which executor stuff is sent to. So I'm not sure if you're saying you see all that and you need more, or that you're not seeing this type of information. I cannot speak to the ec2 setup, just pointing out that under 0.9.1 I see quite a bit of scheduling information in the driver log. On Tue, Jul 1, 2014 at 9:20 AM, Philip Limbeck philiplimb...@gmail.com wrote: We changed the loglevel to DEBUG by replacing every INFO with DEBUG in /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the cluster. There is some DEBUG output visible in both master and worker but nothing really interesting regarding stages or scheduling. Since we expected a little more than that, there could be 2 possibilites: a) There is still some other unknown way to set the loglevel to debug b) There is not that much log output to be expected in this direction, I looked for logDebug (The log wrapper in spark) in github with 84 results, which means that I doubt that there is not much else to expect. We actually just want to have a little more insight into the system behavior especially when using Shark since we ran into some serious concurrency issues with blocking queries. So much for the background why this is important to us. On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson ilike...@gmail.com wrote: If you're using the spark-ec2 scripts, you may have to change /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that is added to the classpath before Spark's own conf. On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote: I have a log4j.xml in src/main/resources with ?xml version=1.0 encoding=UTF-8 ? !DOCTYPE log4j:configuration SYSTEM log4j.dtd log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/; [...] root priority value =warn / appender-ref ref=Console / /root /log4j:configuration and that is included in the jar I package with `sbt assembly`. That works fine for me, at least on the driver. Tobias On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck philiplimb...@gmail.com wrote: Hi! According to https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging, changing log-level is just a matter of creating a log4j.properties (which is in the classpath of spark) and changing log level there for the root logger. I did this steps on every node in the cluster (master and worker nodes). However, after restart there is still no debug output as desired, but only the default info log level.
Spark 1.0: Unable to Read LZO Compressed File
Dear Spark Users: Spark 1.0 has been installed as Standalone - But it can't read any compressed (CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed files from HDFS). The key notable message is: Unable to load native-hadoop library.. Other related messages are - Caused by: java.lang.IllegalStateException: Cannot load com.ibm.biginsights.compress.CmxDecompressor without native library! at com.ibm.biginsights.compress.CmxDecompressor.clinit(CmxDecompressor.java:65) Here is the core-site.xml's key part: nameio.compression.codecs/name valueorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.ibm.biginsights.compress.CmxCodec/value /property Here is the spark.env.sh: export SPARK_WORKER_CORES=4 export SPARK_WORKER_MEMORY=10g export SCALA_HOME=/opt/spark/scala-2.11.1 export JAVA_HOME=/opt/spark/jdk1.7.0_55 export SPARK_HOME=/opt/spark/spark-0.9.1-bin-hadoop2 export ADD_JARS=/opt/IHC/lib/compression.jar export SPARK_CLASSPATH=/opt/IHC/lib/compression.jar export SPARK_LIBRARY_PATH=/opt/IHC/lib/native/Linux-amd64-64/ export SPARK_MASTER_WEBUI_PORT=1080 export HADOOP_CONF_DIR=/opt/IHC/hadoop-conf Note: core-site.xml and hdfs-site.xml are in hadoop-conf. CMX is an IBM branded splittable LZO based compression codec. Any help to resolve the issue is appreciated. Thanks, Nasir DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
Re: Spark Streaming question batch size
Are you saying that both streams come in at the same rate and you have the same batch interval but the batch size ends up different? i.e. two datapoints both arriving at X seconds after streaming starts end up in two different batches? How do you define real time values for both streams? I am trying to do something similar to you, I think -- but I'm not clear on what your notion of time is. My reading of your example above is that the streams just pump data in at different rates -- first one got 7462 points in the first batch interval, whereas stream2 saw 10493 On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work. Although they start running simultaneously, they might have different number of elements in each time interval. The following is output for two streams which have same number of elements and ran simultaneously. The left most value is the number of elements in each window. If we add the number of elements them, they are same for both streams but we can't compare both streams as they are different in window size and number of windows. Can we somehow make windows based on real time values for both streams? or Can we make windows based on number of elements? (n, (mean, varience, SD)) Stream 1 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025)) (44826,(0.2546925855084064,5042.890184382894,71.0133099100647)) (245466,(0.2857731601728941,5014.411691661449,70.81251084138628)) (154852,(0.21907814309792514,3483.800160602281,59.023725404300606)) (156345,(0.3075668844414613,7449.528181550462,86.31064929399189)) (156603,(0.27785151491351234,5917.809892281489,76.9273026452994)) (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623)) Stream 2 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536)) (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975)) (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792)) (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888)) (269817,(0.16987953223480945,3270.663944782799,57.18971887308766)) (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577)) Regards, Laeeq
Re: Question about VD and ED
Hi Bin, VD and ED are ClassTags, you could treat them as placeholder, or template T in C (not 100% clear). You do not need convert graph[String, Double] to Graph[VD,ED]. Check ClassTag’s definition in Scala could help. Best, On Jul 1, 2014, at 4:49 AM, Bin WU bw...@connect.ust.hk wrote: Hi all, I am a newbie to graphx. I am currently having troubles understanding the types VD and ED. I notice that VD and ED are widely used in graphx implementation, but I don't know why and how I am supposed to use them. Specifically, say I have constructed a graph graph : Graph[String, Double], when, why, and how should I transform it into the type Graph[VD, ED]? Also, I don't know what package should I import. I have imported: import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD But the two types VD and ED are still not found. Sorry for the stupid question. Thanks in advance! Best, Ben
Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector
You can use either bin/run-example or bin/spark-summit to run example code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark classpath. There are examples in the official doc: http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here -Xiangrui On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote: Hello, I have installed spark-1.0.0 with scala2.10.3. I have built spark with sbt/sbt assembly and added /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar to my CLASSPATH variable. Then I went here ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a new directory classes and compiled SparkKMeans.scala with scalac -d classes/ SparkKMeans.scala Then I navigated to classes (I commented this line in the scala file : package org.apache.spark.examples ) and tried to run it with java -cp . SparkKMeans and I get the following error: Exception in thread main java.lang.NoClassDefFoundError: breeze/linalg/Vector at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 6 more The jar under /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar contains the breeze/linalg/Vector* path, I even tried to unpack it and put it in CLASSPATH to it does not seem to pick it up I am currently running java 1.8 java version 1.8.0_05 Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode) What I am doing wrong ?
Re: Questions about disk IOs
Try to reduce number of partitions to match the number of cores. We will add treeAggregate to reduce the communication cost. PR: https://github.com/apache/spark/pull/1110 -Xiangrui On Tue, Jul 1, 2014 at 12:55 AM, Charles Li littlee1...@gmail.com wrote: Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not. The program is simple: points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..) points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached gradient = new LogisticGrandient(); updater = new SquaredL2Updater(); initWeight = Vectors.sparse(size, new int[]{}, new double[]{}) result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, convergeTol, maxIter, regParam, initWeight); I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster mode. Below are some arguments I am using: —executor-memory 10G —num-executors 50 —executor-cores 2 Storage Using: When caching: Cached Partitions 951 Fraction Cached 100% Size in Memory 215.7GB Size in Tachyon 0.0B Size on Disk 1029.7MB The time cost by every aggregate is around 5 minutes with cache enabled. Lots of disk IOs can be seen on the hadoop node. I have the same result with cache disabled. Should data points caching improve the performance? Should caching decrease the disk IO? Thanks in advance.
Spark Summit 2014 Day 2 Video Streams?
I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: Spark Summit 2014 Day 2 Video Streams?
*General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014Track A : http://www.ustream.tv/channel/track-a1 http://www.ustream.tv/channel/track-a1Track B: http://www.ustream.tv/channel/track-b1 http://www.ustream.tv/channel/track-b1Track C: http://www.ustream.tv/channel/track-c1 http://www.ustream.tv/channel/track-c1* On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: Could not compute split, block not found
Hi Tobias, Your explanation makes a lot of sense. Actually, I tried to use partial data on the same program yesterday. It has been up for around 24 hours and is still running correctly. Thanks! Bill On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in your streaming data, I think the way Spark handles it is very nice, in particular since 1) it does not mix up the order in which items arrived in the stream, so items from a later window will always be processed later, and 2) because an increase in data will not be punished with high load and unresponsive systems, but with disk space consumption instead. However, if all of your windows require t' t processing time (and it's not because you are waiting, but because you actually do some computation), then you are in bad luck, because if you start processing the next window while the previous one is still processed, you have less resources for each and processing will take even longer. However, if you are only waiting (e.g., for network I/O), then maybe you can employ some asynchronous solution where your tasks return immediately and deliver their result via a callback later? Tobias On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Your suggestion is very helpful. I will definitely investigate it. Just curious. Suppose the batch size is t seconds. In practice, does Spark always require the program to finish processing the data of t seconds within t seconds' processing time? Can Spark begin to consume the new batch before finishing processing the next batch? If Spark can do them together, it may save the processing time and solve the problem of data piling up. Thanks! Bill On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote: If your batch size is one minute and it takes more than one minute to process, then I guess that's what causes your problem. The processing of the second batch will not start after the processing of the first is finished, which leads to more and more data being stored and waiting for processing; check the attached graph for a visualization of what I think may happen. Can you maybe do something hacky like throwing away a part of the data so that processing time gets below one minute, then check whether you still get that error? Tobias On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Thanks for your help. I think in my case, the batch size is 1 minute. However, it takes my program more than 1 minute to process 1 minute's data. I am not sure whether it is because the unprocessed data pile up. Do you have an suggestion on how to check it and solve it? Thanks! Bill On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, were you able to process all information in time, or did maybe some unprocessed data pile up? I think when I saw this once, the reason seemed to be that I had received more data than would fit in memory, while waiting for processing, so old data was deleted. When it was time to process that data, it didn't exist any more. Is that a possible reason in your case? Tobias On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi, I am running a spark streaming job with 1 minute as the batch size. It ran around 84 minutes and was killed because of the exception with the following information: java.lang.Exception: Could not compute split, block input-0-1403893740400 not found Before it was killed, it was able to correctly generate output for each batch. Any help on this will be greatly appreciated. Bill
Re: Could not compute split, block not found
Hi Tathagata, Yes. The input stream is from Kafka and my program reads the data, keeps all the data in memory, process the data, and generate the output. Bill On Mon, Jun 30, 2014 at 11:45 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you by any change using only memory in the storage level of the input streams? TD On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in your streaming data, I think the way Spark handles it is very nice, in particular since 1) it does not mix up the order in which items arrived in the stream, so items from a later window will always be processed later, and 2) because an increase in data will not be punished with high load and unresponsive systems, but with disk space consumption instead. However, if all of your windows require t' t processing time (and it's not because you are waiting, but because you actually do some computation), then you are in bad luck, because if you start processing the next window while the previous one is still processed, you have less resources for each and processing will take even longer. However, if you are only waiting (e.g., for network I/O), then maybe you can employ some asynchronous solution where your tasks return immediately and deliver their result via a callback later? Tobias On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Your suggestion is very helpful. I will definitely investigate it. Just curious. Suppose the batch size is t seconds. In practice, does Spark always require the program to finish processing the data of t seconds within t seconds' processing time? Can Spark begin to consume the new batch before finishing processing the next batch? If Spark can do them together, it may save the processing time and solve the problem of data piling up. Thanks! Bill On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp wrote: If your batch size is one minute and it takes more than one minute to process, then I guess that's what causes your problem. The processing of the second batch will not start after the processing of the first is finished, which leads to more and more data being stored and waiting for processing; check the attached graph for a visualization of what I think may happen. Can you maybe do something hacky like throwing away a part of the data so that processing time gets below one minute, then check whether you still get that error? Tobias On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Tobias, Thanks for your help. I think in my case, the batch size is 1 minute. However, it takes my program more than 1 minute to process 1 minute's data. I am not sure whether it is because the unprocessed data pile up. Do you have an suggestion on how to check it and solve it? Thanks! Bill On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, were you able to process all information in time, or did maybe some unprocessed data pile up? I think when I saw this once, the reason seemed to be that I had received more data than would fit in memory, while waiting for processing, so old data was deleted. When it was time to process that data, it didn't exist any more. Is that a possible reason in your case? Tobias On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi, I am running a spark streaming job with 1 minute as the batch size. It ran around 84 minutes and was killed because of the exception with the following information: java.lang.Exception: Could not compute split, block input-0-1403893740400 not found Before it was killed, it was able to correctly generate output for each batch. Any help on this will be greatly appreciated. Bill
spark streaming rate limiting from kafka
In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go out of memory on the fetch of the 1st batch. I am wondering if * Is there a way to throttle reading from kafka in spark streaming jobs? * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. * Is there a way to limit the consumption rate at Kafka side? (This one is not actually for spark streaming and doesn't seem to be question in this group. But I am raising it anyway here.) I have looked at code example below but doesn't seem it is supported. KafkaUtils.createStream ... Thanks, All -- Chen Song
Re: Improving Spark multithreaded performance?
This all seems pretty hackish and a lot of trouble to get around limitations in mllib. The big limitation is that right now, the optimization algorithms work on one large dataset at a time. We need a second of set of methods to work on a large number of medium sized datasets. I've started to code a new set of optimization methods to add into mllib. I've started with GroupedGradientDecent ( https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala ) GroupedGradientDecent is based on GradientDecent, but instead, it takes RDD[(Int, (Double, Vector))] as its data input rather then RDD[(Double, Vector)]. The Int serves as key to mark which elements should be grouped together. This lets you multiplex several dataset optimizations into the same RDD. I think I've gotten the GroupedGradientDecent to work correctly. I need to go up the stack and start adding methods like SVMWithSGD.trainGroup. Does anybody have any thoughts on this? Kyle On Fri, Jun 27, 2014 at 6:36 PM, Xiangrui Meng men...@gmail.com wrote: The RDD is cached in only one or two workers. All other executors need to fetch its content via network. Since the dataset is not huge, could you try this? val features: Array[Vector] = ... val featuresBc = sc.broadcast(features) // parallel loops val labels: Array[Double] = val rdd = sc.parallelize(0 until 1, 1).flatMap(i = featuresBc.value.view.zip(labels)) val model = SVMWithSGD.train(rdd) models(i) = model Using BT broadcast factory would improve the performance of broadcasting. Best, Xiangrui On Fri, Jun 27, 2014 at 3:06 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: 1) I'm using the static SVMWithSGD.train, with no options. 2) I have about 20,000 features (~5000 samples) that are being attached and trained against 14,000 different sets of labels (ie I'll be doing 14,000 different training runs against the same sets of features trying to figure out which labels can be learned), and I would also like to do cross fold validation. The driver doesn't seem to be using too much memory. I left it as -Xmx8g and it never complained. Kyle On Fri, Jun 27, 2014 at 1:18 PM, Xiangrui Meng men...@gmail.com wrote: Hi Kyle, A few questions: 1) Did you use `setIntercept(true)`? 2) How many features? I'm a little worried about driver's load because the final aggregation and weights update happen on the driver. Did you check driver's memory usage as well? Best, Xiangrui On Fri, Jun 27, 2014 at 8:10 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: As far as I can tell there are is no data to broadcast (unless there is something internal to mllib that needs to be broadcast) I've coalesced the input RDDs to keep the number of partitions limited. When running, I've tried to get up to 500 concurrent stages, and I've coalesced the RDDs down to 2 partitions, so about 1000 tasks. Despite having over 500 threads in the threadpool working on mllib tasks, the total CPU usage never really goes above 150%. I've tried increasing 'spark.akka.threads' but that doesn't seem to do anything. My one thought would be that maybe because I'm using MLUtils.kFold to generate the RDDs is that because I have so many tasks working off RDDs that are permutations of original RDDs that maybe that is creating some sort of dependency bottleneck. Kyle On Thu, Jun 26, 2014 at 6:35 PM, Aaron Davidson ilike...@gmail.com wrote: I don't have specific solutions for you, but the general things to try are: - Decrease task size by broadcasting any non-trivial objects. - Increase duration of tasks by making them less fine-grained. How many tasks are you sending? I've seen in the past something like 25 seconds for ~10k total medium-sized tasks. On Thu, Jun 26, 2014 at 12:06 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I'm working to set up a calculation that involves calling mllib's SVMWithSGD.train several thousand times on different permutations of the data. I'm trying to run the separate jobs using a threadpool to dispatch the different requests to a spark context connected a Mesos's cluster, using course scheduling, and a max of 2000 cores on Spark 1.0. Total utilization of the system is terrible. Most of the 'aggregate at GradientDescent.scala:178' stages(where mllib spends most of its time) take about 3 seconds, but have ~25 seconds of scheduler delay time. What kind of things can I do to improve this? Kyle
Re: spark streaming rate limiting from kafka
Maybe reducing the batch duration would help :\ 2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com: In my use case, if I need to stop spark streaming for a while, data would accumulate a lot on kafka topic-partitions. After I restart spark streaming job, the worker's heap will go out of memory on the fetch of the 1st batch. I am wondering if * Is there a way to throttle reading from kafka in spark streaming jobs? * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. * Is there a way to limit the consumption rate at Kafka side? (This one is not actually for spark streaming and doesn't seem to be question in this group. But I am raising it anyway here.) I have looked at code example below but doesn't seem it is supported. KafkaUtils.createStream ... Thanks, All -- Chen Song
Re: Re: spark table to hive table
Michael - Does Spark SQL support rlike and like yet? I am running into that same error with a basic select * from table where field like '%foo%' using the hql() funciton. Thanks On Wed, May 28, 2014 at 2:22 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung itsjb.j...@samsung.com wrote: I already tried HiveContext as well as SqlContext. But it seems that Spark's HiveContext is not completely same as Apache Hive. For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST LIMIT 10' works fine in Apache Hive, Spark SQL doesn't support window functions yet (SPARK-1442 https://issues.apache.org/jira/browse/SPARK-1442). Sorry for the non-obvious error message!
Re: Spark Summit 2014 Day 2 Video Streams?
Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: *General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1 http://www.ustream.tv/channel/track-a1Track B: http://www.ustream.tv/channel/track-b1 http://www.ustream.tv/channel/track-b1 Track C: http://www.ustream.tv/channel/track-c1 http://www.ustream.tv/channel/track-c1* On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: Failed to launch Worker
Where are you running the spark-class version? Hopefully also on the workers. If you're trying to centrally start/stop all workers, you can add a slaves file to the spark conf/ directory which is just a list of your hosts, one per line. Then you can just use ./sbin/start-slaves.sh to start the worker on all of your machines. Note that this is already setup correctly if you're using the spark-ec2 scripts. On Tue, Jul 1, 2014 at 5:53 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Yes. Thanks Regards, Meethu M On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Is this command working?? java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/ assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 Thanks Best Regards On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi , I am using Spark Standalone mode with one master and 2 slaves.I am not able to start the workers and connect it to the master using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077 The log says Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/x.x.x.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address When I try to start the worker from the slaves using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:master:7077 Thanks Regards, Meethu M
Re: Spark Streaming question batch size
Hi Yana, Yes, that is what I am saying. I need both streams to be at same pace. I do have timestamps for each datapoint. There is a way suggested by Tathagata das in an earlier post where you have have a bigger window than required and you fetch your required data from that window based on your timestamps. I was just looking if there are other cleaner ways to do it. Regards Laeeq On Tuesday, July 1, 2014 4:23 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Are you saying that both streams come in at the same rate and you have the same batch interval but the batch size ends up different? i.e. two datapoints both arriving at X seconds after streaming starts end up in two different batches? How do you define real time values for both streams? I am trying to do something similar to you, I think -- but I'm not clear on what your notion of time is. My reading of your example above is that the streams just pump data in at different rates -- first one got 7462 points in the first batch interval, whereas stream2 saw 10493 On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: Hi, The window size in a spark streaming is time based which means we have different number of elements in each window. For example if you have two streams (might be more) which are related to each other and you want to compare them in a specific time interval. I am not clear how it will work. Although they start running simultaneously, they might have different number of elements in each time interval. The following is output for two streams which have same number of elements and ran simultaneously. The left most value is the number of elements in each window. If we add the number of elements them, they are same for both streams but we can't compare both streams as they are different in window size and number of windows. Can we somehow make windows based on real time values for both streams? or Can we make windows based on number of elements? (n, (mean, varience, SD)) Stream 1 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025)) (44826,(0.2546925855084064,5042.890184382894,71.0133099100647)) (245466,(0.2857731601728941,5014.411691661449,70.81251084138628)) (154852,(0.21907814309792514,3483.800160602281,59.023725404300606)) (156345,(0.3075668844414613,7449.528181550462,86.31064929399189)) (156603,(0.27785151491351234,5917.809892281489,76.9273026452994)) (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623)) Stream 2 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536)) (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975)) (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792)) (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888)) (269817,(0.16987953223480945,3270.663944782799,57.18971887308766)) (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577)) Regards, Laeeq
why is toBreeze private everywhere in mllib?
its kind of handy to be able to convert stuff to breeze... is there some other way i am supposed to access that functionality?
[ANNOUNCE] Flambo - A Clojure DSL for Apache Spark
Yieldbot is pleased to announce the release of Flambo, our Clojure DSL for Apache Spark. Flambo allows one to write spark applications in pure Clojure as an alternative to Scala, Java and Python currently available in Spark. We have already written a substantial amount of internal code in clojure using flambo and we are excited to hear and see what other will come up with. As ever, Pull Request and/or Issues on Github are greatly appreciated! You can find links to source, api docs and literate source code here: http://bit.ly/V8FmzC -- @sorenmacbeth
Re: why is toBreeze private everywhere in mllib?
We were not ready to expose it as a public API in v1.0. Both breeze and MLlib are in rapid development. It would be possible to expose it as a developer API in v1.1. For now, it should be easy to define a toBreeze method in your own project. -Xiangrui On Tue, Jul 1, 2014 at 12:17 PM, Koert Kuipers ko...@tresata.com wrote: its kind of handy to be able to convert stuff to breeze... is there some other way i am supposed to access that functionality?
Re: Spark SQL : Join throws exception
Seems it is a bug. I have opened https://issues.apache.org/jira/browse/SPARK-2339 to track it. Thank you for reporting it. Yin On Tue, Jul 1, 2014 at 12:06 PM, Subacini B subac...@gmail.com wrote: Hi All, Running this join query sql(SELECT * FROM A_TABLE A JOIN B_TABLE B WHERE A.status=1).collect().foreach(println) throws Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:3 failed 4 times, most recent failure: Exception failure in TID 12 on host X.X.X.X: *org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No function to evaluate expression. type: UnresolvedAttribute, tree: 'A.status* org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59) org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147) org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52) org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52) scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137) org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) java.lang.Thread.run(Thread.java:695) Driver stacktrace: Can someone help me. Thanks in advance.
spark-submit script and spark.files.userClassPathFirst
Hi, I'm trying to get rid of an error (NoSuchMethodError) while using Amazon's s3 client on Spark. I'm using the Spark Submit script to run my code. Reading about my options and other threads, it seemed the most logical way would be make sure my jar is loaded first. Spark submit on debug shows the same: spark.files.userClassPathFirst - true However, I can't seem to get rid of the error at runtime. The issue I believe is happening because Spark has the older httpclient library in it's classpath. I'm using: dependency groupIdorg.apache.httpcomponents/groupId artifactIdhttpclient/artifactId version4.3/version /dependency Any clues what might be happening? Stack trace below: Exception in thread main java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:138) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:112) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:101) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:87) at org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:95) at com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26) at com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97) at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:158) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119) at com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:357) at com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:339) at com.evocalize.rickshaw.commons.s3.S3Util.getS3Client(S3Util.java:35) at com.evocalize.rickshaw.commons.s3.S3Util.putFileFromLocalFS(S3Util.java:40) at com.evocalize.rickshaw.spark.actions.PackageFilesFunction.call(PackageFilesFunction.java:48) at com.evocalize.rickshaw.spark.applications.GenerateSEOContent.main(GenerateSEOContent.java:76) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-script-and-spark-files-userClassPathFirst-tp8604.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
slf4j multiple bindings
Hi all, I have an issue with multiple slf4j bindings. My program was running correctly. I just added the new dependency kryo. And when I submitted a job, the job was killed because of the following error messages: *SLF4J: Class path contains multiple SLF4J bindings.* The log said there were three slf4j bindings: spark-assembly-0.9.1-hadoop2.3.0.jar, hadoop lib, and my own jar file. However, I did not explicitly add slf4j in my pom.xml file. I added exclusions in the dependency of kryo but it did not work. Does anyone has an idea how to fix this issue? Thanks! Regards, Bill
Lost TID: Loss was due to fetch failure from BlockManagerId
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don't see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster? 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90) 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0) 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0) 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34) 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0) 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4) 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0) 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0) 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218) 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1) 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0) 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0) 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0) 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0) 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0) 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9) 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0) 14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0) 14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98) 14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0) 14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0) 14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(1, 192.168.222.152, 45896, 0) 14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0) 14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from
Re: multiple passes in mapPartitions
also, multiple calls to mapPartitions() will be pipelined by the spark execution engine into a single stage, so the overhead is minimal. On Fri, Jun 13, 2014 at 9:28 PM, zhen z...@latrobe.edu.au wrote: Thank you for your suggestion. We will try it out and see how it performs. We think the single call to mapPartitions will be faster but we could be wrong. It would be nice to have a clone method on the iterator. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/multiple-passes-in-mapPartitions-tp7555p7616.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Lost TID: Loss was due to fetch failure from BlockManagerId
A lot of things can get funny when you run distributed as opposed to local -- e.g. some jar not making it over. Do you see anything of interest in the log on the executor machines -- I'm guessing 192.168.222.152/192.168.222.164. From here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala seems like the warning message is logged after the task fails -- but I wonder if you might see something more useful as to why it failed to begin with. As an example we've had cases in Hdfs where a small example would work, but on a larger example we'd hit a bad file. But the executor log is usually pretty explicit as to what happened... On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com wrote: I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don’t see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster? 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90) 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0) 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0) 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34) 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0) 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4) 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0) 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0) 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218) 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1) 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0) 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0) 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0) 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0) 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0) 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9) 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task
Re: Fw: How Spark Choose Worker Nodes for respective HDFS block
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL) where possible just like MapReduce. it's a best practice to co-locate your Spark Workers on the same nodes as your HDFS Name Nodes for just this reason. this is achieved through the RDD.preferredLocations() interface method: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD on a related note, you can configure spark.locality.wait as the number of millis to wait before falling back to a less-local data node (RACK_LOCAL): http://spark.apache.org/docs/latest/configuration.html -chris On Fri, Jun 13, 2014 at 11:06 PM, anishs...@yahoo.co.in anishs...@yahoo.co.in wrote: Hi All Is there any communication between Spark MASTER node and Hadoop NameNode while distributing work to WORKER nodes, like we have in MapReduce. Please suggest TIA -- Anish Sneh Experience is the best teacher. http://in.linkedin.com/in/anishsneh -- * From: * anishs...@yahoo.co.in anishs...@yahoo.co.in; * To: * u...@spark.incubator.apache.org u...@spark.incubator.apache.org; * Subject: * How Spark Choose Worker Nodes for respective HDFS block * Sent: * Fri, Jun 13, 2014 9:17:50 PM Hi All I am new to Spark, workin on 3 node test cluster. I am trying to explore Spark scope in analytics, my Spark codes interacts with HDFS mostly. I have a confusion that how Spark choose on which node it will distribute its work. Since we assume that it can be an alternative to Hadoop MapReduce. In MapReduce we know that internally framework will distribute code (or logic) to the nearest TaskTracker which are co-located with DataNode or in same rack or probably nearest to the data blocks. My confusion is when I give HDFS path inside a Spark program how it choose which node is nearest (if it does). If it does not then how it will work when I have TBs of data where high network latency will be involved. My apologies for asking basic question, please suggest. TIA -- Anish Sneh Experience is the best teacher. http://www.anishsneh.com
Re: Spark Summit 2014 Day 2 Video Streams?
They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: *General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1 http://www.ustream.tv/channel/track-a1Track B: http://www.ustream.tv/channel/track-b1 http://www.ustream.tv/channel/track-b1 Track C: http://www.ustream.tv/channel/track-c1 http://www.ustream.tv/channel/track-c1* On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: spark streaming rate limiting from kafka
Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting this to a small number, it will force DStream to read less data initially. Please see the post at http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E Kafka's auto.offset.reset parameter may be what you are looking for. Tobias
Re: Spark 1.0: Unable to Read LZO Compressed File
I’d suggest asking the IBM Hadoop folks, but my guess is that the library cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this exception is happening in your driver program, the driver program’s java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh only applies to stuff launched on the clusters). Matei On Jul 1, 2014, at 7:15 AM, Uddin, Nasir M. nud...@dtcc.com wrote: Dear Spark Users: Spark 1.0 has been installed as Standalone – But it can’t read any compressed (CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed files from HDFS). The key notable message is: “Unable to load native-hadoop library…..”. Other related messages are – Caused by: java.lang.IllegalStateException: Cannot load com.ibm.biginsights.compress.CmxDecompressor without native library! at com.ibm.biginsights.compress.CmxDecompressor.clinit(CmxDecompressor.java:65) Here is the core-site.xml’s key part: nameio.compression.codecs/name valueorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.ibm.biginsights.compress.CmxCodec/value /property Here is the spark.env.sh: export SPARK_WORKER_CORES=4 export SPARK_WORKER_MEMORY=10g export SCALA_HOME=/opt/spark/scala-2.11.1 export JAVA_HOME=/opt/spark/jdk1.7.0_55 export SPARK_HOME=/opt/spark/spark-0.9.1-bin-hadoop2 export ADD_JARS=/opt/IHC/lib/compression.jar export SPARK_CLASSPATH=/opt/IHC/lib/compression.jar export SPARK_LIBRARY_PATH=/opt/IHC/lib/native/Linux-amd64-64/ export SPARK_MASTER_WEBUI_PORT=1080 export HADOOP_CONF_DIR=/opt/IHC/hadoop-conf Note: core-site.xml and hdfs-site.xml are in hadoop-conf. CMX is an IBM branded splittable LZO based compression codec. Any help to resolve the issue is appreciated. Thanks, Nasir DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
Re: Spark Summit 2014 Day 2 Video Streams?
Awesome. Just want to catch up on some sessions from other tracks. Learned a ton over the last two days. Thanks Soumya On Jul 1, 2014, at 8:50 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, we’re going to try to get the videos up as soon as possible. Matei On Jul 1, 2014, at 7:47 PM, Marco Shaw marco.s...@gmail.com wrote: They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1 Track B: http://www.ustream.tv/channel/track-b1 Track C: http://www.ustream.tv/channel/track-c1 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com wrote: I attended yesterday on ustream.tv, but can't find the links to today's streams anywhere. help! -- Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0
In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of that kin? This error suggests the worker is trying to bind a server on the master's IP, which clearly doesn't make sense On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 LISTEN(after starting master) I tried to execute the following script from the slaves manually but it ends up with the same exception and log.This script is internally executing the java command. /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077 In this case netstat is showing any connection established to master:7077. When we manually execute the java command,the connection is getting established to master. Thanks Regards, Meethu M On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you sure you have this ip 192.168.125.174 http://192.168.125.174:0/ bind for that machine? (netstat -na | grep 192.168.125.174 http://192.168.125.174:0/) Thanks Best Regards On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, I reinstalled spark,reboot the system,but still I am not able to start the workers.Its throwing the following exception: Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 I doubt the problem is with 192.168.125.174:0. Eventhough the command contains master:7077,why its showing 0 in the log. java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://master:7077 Can somebody tell me a solution. Thanks Regards, Meethu M On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, ya I tried setting another PORT also,but the same problem.. master is set in etc/hosts Thanks Regards, Meethu M On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote: tha's strange, did you try setting the master port to something else (use SPARK_MASTER_PORT). Also you said you are able to start it from the java commandline java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/ assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://:*master*:7077 What is the master ip specified here? is it like you have entry for *master* in the /etc/hosts? Thanks Best Regards On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi Akhil, I am running it in a LAN itself..The IP of the master is given correctly. Thanks Regards, Meethu M On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote: why is it binding to port 0? 192.168.125.174:0 :/ Check the ip address of that master machine (ifconfig) looks like the ip address has been changed (hoping you are running this machines on a LAN) Thanks Best Regards On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi all, My Spark(Standalone mode) was running fine till yesterday.But now I am getting the following exeception when I am running start-slaves.sh or start-all.sh slave3: failed to launch org.apache.spark.deploy.worker.Worker: slave3: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) slave3: at java.lang.Thread.run(Thread.java:662) The log files has the following lines. 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser) 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started 14/06/27 11:06:30 INFO Remoting: Starting remoting Exception in thread main org.jboss.netty.channel.ChannelException: Failed to bind to: master/192.168.125.174:0 at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) ... Caused by: java.net.BindException: Cannot assign requested address ... I saw the same error reported before and have tried the following solutions. Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a different number..But nothing is working. When I try to start the worker from the respective machines using the following java command,its running without any exception java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m
Re: multiple passes in mapPartitions
Hi Zhen, The Scala iterator trait supports cloning via the duplicate method (http://www.scala-lang.org/api/current/index.html#scala.collection.Iterator@duplicate:(Iterator[A],Iterator[A])). Regards, Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Jun 13, 2014, at 9:28 PM, zhen z...@latrobe.edu.au wrote: Thank you for your suggestion. We will try it out and see how it performs. We think the single call to mapPartitions will be faster but we could be wrong. It would be nice to have a clone method on the iterator. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/multiple-passes-in-mapPartitions-tp7555p7616.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Lost TID: Loss was due to fetch failure from BlockManagerId
It could be cause you are out of memory on the worker nodes blocks are not getting registered.. A older issue with 0.6.0 was with dead nodes causing loss of task then resubmission of data in an infinite loop... It was fixed in 0.7.0 though. Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164 or any of the machines where the crash log is displayed. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: A lot of things can get funny when you run distributed as opposed to local -- e.g. some jar not making it over. Do you see anything of interest in the log on the executor machines -- I'm guessing 192.168.222.152/192.168.222.164. From here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala seems like the warning message is logged after the task fails -- but I wonder if you might see something more useful as to why it failed to begin with. As an example we've had cases in Hdfs where a small example would work, but on a larger example we'd hit a bad file. But the executor log is usually pretty explicit as to what happened... On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com wrote: I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 worker). Our app is fetching data from Cassandra and doing a basic filter, map, and countByKey on that data. I have run into a strange problem. Even if the number of rows in Cassandra is just 1M, the Spark job goes seems to go into an infinite loop and runs for hours. With a small amount of data (less than 100 rows), the job does finish, but takes almost 30-40 seconds and we frequently see the messages shown below. If we run the same application on a single node Spark (--master local[4]), then we don’t see these warnings and the task finishes in less than 6-7 seconds. Any idea what could be the cause for these problems when we run our application on a standalone 4-node spark cluster? 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90) 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0) 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0) 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34) 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0) 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(0, 192.168.222.142, 39342, 0) 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4) 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0) 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0) 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218) 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1) 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0) 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0) 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(2, 192.168.222.164, 57185, 0) 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0) 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from