Re: Could not compute split, block not found

2014-07-01 Thread Tathagata Das
Are you by any change using only memory in the storage level of the input
streams?

TD


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 let's say the processing time is t' and the window size t. Spark does not
 *require* t'  t. In fact, for *temporary* peaks in your streaming data, I
 think the way Spark handles it is very nice, in particular since 1) it does
 not mix up the order in which items arrived in the stream, so items from a
 later window will always be processed later, and 2) because an increase in
 data will not be punished with high load and unresponsive systems, but with
 disk space consumption instead.

 However, if all of your windows require t'  t processing time (and it's
 not because you are waiting, but because you actually do some computation),
 then you are in bad luck, because if you start processing the next window
 while the previous one is still processed, you have less resources for each
 and processing will take even longer. However, if you are only waiting
 (e.g., for network I/O), then maybe you can employ some asynchronous
 solution where your tasks return immediately and deliver their result via a
 callback later?

 Tobias



 On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does
 Spark always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 








Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread MEETHU MATHEW
Hi,

I did netstat -na | grep 192.168.125.174 and its showing 192.168.125.174:7077 
LISTEN(after starting master)

I tried to execute the following script from the slaves manually but it ends up 
with the same exception and log.This script is internally executing the java 
command.
 /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077
In this case netstat is showing any connection established to master:7077.

When we manually execute the java command,the connection is getting established 
to master.

Thanks  Regards, 
Meethu M


On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Are you sure you have this ip 192.168.125.174 bind for that machine? (netstat 
-na | grep 192.168.125.174)


Thanks
Best Regards


On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi all,


I reinstalled spark,reboot the system,but still I am not able to start the 
workers.Its throwing the following exception:


Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/192.168.125.174:0


I doubt the problem is with 192.168.125.174:0. Eventhough the command contains 
master:7077,why its showing 0 in the log.


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://master:7077


Can somebody tell me  a solution.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:
 


Hi,
ya I tried setting another PORT also,but the same problem..
master is set in etc/hosts
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


tha's strange, did you try setting the master port to something else (use 
SPARK_MASTER_PORT).


Also you said you are able to start it from the java commandline


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



What is the master ip specified here? is it like you have entry for master in 
the /etc/hosts? 


Thanks
Best Regards


On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

Hi Akhil,


I am running it in a LAN itself..The IP of the master is given correctly.
 
Thanks  Regards, 
Meethu M



On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


why is it binding to port 0? 192.168.125.174:0 :/


Check the ip address of that master machine (ifconfig) looks like the ip 
address has been changed (hoping you are running this machines on a LAN)


Thanks
Best Regards


On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in 
wrote:

Hi all,


My Spark(Standalone mode) was running fine till yesterday.But now I am 
getting  the following exeception when I am running start-slaves.sh or 
start-all.sh


slave3: failed to launch org.apache.spark.deploy.worker.Worker:
slave3:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
slave3:   at java.lang.Thread.run(Thread.java:662)


The log files has the following lines.


14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hduser)
14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
14/06/27 11:06:30 INFO Remoting: Starting remoting
Exception in thread main org.jboss.netty.channel.ChannelException: Failed 
to bind to: master/192.168.125.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address
...
I saw the same error reported before and have tried the following solutions.


Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a 
different number..But nothing is working.


When I try to start the worker from the respective machines using the 
following java command,its running without any exception


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077



Somebody please give a solution
 
Thanks  Regards, 
Meethu M









Re: Serialization of objects

2014-07-01 Thread Aaron Davidson
If you want to stick with Java serialization and need to serialize a
non-Serializable object, your best choices are probably to either subclass
it with a Serializable one or wrap it in a class of your own which
implements its own writeObject/readObject methods (see here:
http://stackoverflow.com/questions/6163872/how-to-serialize-a-non-serializable-in-java
)

Otherwise you can use Kryo to register custom serializers for other
people's objects.


On Mon, Jun 30, 2014 at 1:52 PM, Sameer Tilak ssti...@live.com wrote:

 Hi everyone,
 I was able to solve this issue. For now I changed the library code and
 added the following to the class com.wcohen.ss.BasicStringWrapper:

 public class BasicStringWrapper implements  Serializable

 However, I am still curious to know ho to get around the issue when you
 don't have access to the code and you are using a 3rd party jar.


 --
 From: ssti...@live.com
 To: u...@spark.incubator.apache.org
 Subject: Serialization of objects
 Date: Thu, 26 Jun 2014 09:30:31 -0700


 Hi everyone,

 Aaron, thanks for your help so far. I am trying to serialize objects that
 I instantiate from a 3rd party library namely instances of 
 com.wcohen.ss.Jaccard,
 and com.wcohen.ss.BasicStringWrapper. However, I am having problems with
 serialization. I am (at least trying to) using Kryo for serialization. I
  am still facing the serialization issue. I get 
 org.apache.spark.SparkException:
 Job aborted due to stage failure: Task not serializable:
 java.io.NotSerializableException: com.wcohen.ss.BasicStringWrapper Any
 help with this will be great.
 Scala code:

 package approxstrmatch

 import com.wcohen.ss.BasicStringWrapper;
 import com.wcohen.ss.Jaccard;

 import java.util.Iterator;

 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkConf

 import org.apache.spark.rdd;
 import org.apache.spark.rdd.RDD;

 import com.esotericsoftware.kryo.Kryo
 import org.apache.spark.serializer.KryoRegistrator

 class MyRegistrator extends KryoRegistrator {
   override def registerClasses(kryo: Kryo) {
 kryo.register(classOf[approxstrmatch.JaccardScore])
 kryo.register(classOf[com.wcohen.ss.BasicStringWrapper])
 kryo.register(classOf[com.wcohen.ss.Jaccard])

   }
 }

  class JaccardScore  {

   val mjc = new Jaccard()  with Serializable
   val conf = new
 SparkConf().setMaster(spark://pzxnvm2018:7077).setAppName(ApproxStrMatch)
   conf.set(spark.serializer,
 org.apache.spark.serializer.KryoSerializer)
   conf.set(spark.kryo.registrator, approxstrmatch.MyRegistrator)

   val sc = new SparkContext(conf)

   def calculateScoreSecond (sourcerdd: RDD[String], destrdd: RDD[String])
  {
   val jc_ = this.mjc

   var i: Int = 0
   for (sentence - sourcerdd.toLocalIterator)
{val str1 = new BasicStringWrapper (sentence)
 var scorevector = destrdd.map(x = jc_.score(str1, new
 BasicStringWrapper(x)))
 val fileName = new
 String(/apps/software/scala-approsstrmatch-sentence + i)
 scorevector.saveAsTextFile(fileName)
 i += 1
}

   }

 Here is the script:
  val distFile = sc.textFile(hdfs://serverip:54310/data/dummy/sample.txt);
  val srcFile = sc.textFile(hdfs://serverip:54310/data/dummy/test.txt);
  val score = new approxstrmatch.JaccardScore()
  score.calculateScoreSecond(srcFile, distFile)

 O/P:

 14/06/25 12:32:05 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at
 textFile at console:12), which has no missing parents
 14/06/25 12:32:05 INFO DAGScheduler: Submitting 1 missing tasks from Stage
 0 (MappedRDD[3] at textFile at console:12)
 14/06/25 12:32:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
 14/06/25 12:32:05 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
 executor localhost: localhost (PROCESS_LOCAL)
 14/06/25 12:32:05 INFO TaskSetManager: Serialized task 0.0:0 as 1879 bytes
 in 4 ms
 14/06/25 12:32:05 INFO Executor: Running task ID 0
 14/06/25 12:32:05 INFO Executor: Fetching
 http://serverip:47417/jars/approxstrmatch.jar with timestamp 1403724701564
 14/06/25 12:32:05 INFO Utils: Fetching
 http://serverip:47417/jars/approxstrmatch.jar to
 /tmp/fetchFileTemp8194323811657370518.tmp
 14/06/25 12:32:05 INFO Executor: Adding
 file:/tmp/spark-397828b5-3e0e-4bb4-b56b-58895eb4d6df/approxstrmatch.jar to
 class loader
 14/06/25 12:32:05 INFO Executor: Fetching
 http://serverip:47417/jars/secondstring-20140618.jar with timestamp
 1403724701562
 14/06/25 12:32:05 INFO Utils: Fetching
 http://serverip:47417/jars/secondstring-20140618.jar to
 /tmp/fetchFileTemp8711755318201511766.tmp
 14/06/25 12:32:06 INFO Executor: Adding
 file:/tmp/spark-397828b5-3e0e-4bb4-b56b-58895eb4d6df/secondstring-20140618.jar
 to class loader
 14/06/25 12:32:06 INFO BlockManager: Found block broadcast_1 locally
 14/06/25 12:32:06 INFO HadoopRDD: Input split:
 hdfs://serverip:54310/data/dummy/test.txt:0+140
 14/06/25 12:32:06 INFO Executor: Serialized size of result for 0 is 717
 

build spark assign version number myself?

2014-07-01 Thread majian
Hi,all:
 I'm working to compile spark by executing './make-distribution.sh --hadoop 
0.20.205.0 --tgz  ',
 after the completion of the compilation I found  that the default version 
number is 1.1.0-SNAPSHOT  i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz,
 who know how to assign version number myself , for example 
spark-1.1.0-company-bin-0.20.205.tgz  .


Thanks,
majian


issue with running example code

2014-07-01 Thread Gurvinder Singh
Hi,

I am having issue in running scala example code. I have tested and able
to run successfully python example code, but when I run the scala code I
get this error

java.lang.ClassCastException: cannot assign instance of
org.apache.spark.examples.SparkPi$$anonfun$1 to field
org.apache.spark.rdd.MappedRDD.f of type scala.Function1 in instance of
org.apache.spark.rdd.MappedRDD

I have compiled spark from the github directly and running with the
command as

spark-submit /usr/share/spark/lib/spark-examples_2.10-1.1.0-SNAPSHOT.jar
--class org.apache.spark.examples.SparkPi 5 --jars
/usr/share/spark/lib/spark-assembly-1.1.0-SNAPSHOT-hadoop2.h5.0.1.jar

Any suggestions will be helpful.

Thanks,
Gurvinder


Questions about disk IOs

2014-07-01 Thread Charles Li
Hi Spark,

I am running LBFGS on our user data. The data size with Kryo serialisation is 
about 210G. The weight size is around 1,300,000. I am quite confused that the 
performance is very close whether the data is cached or not.

The program is simple:
points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
gradient = new LogisticGrandient();
updater = new SquaredL2Updater();
initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
convergeTol, maxIter, regParam, initWeight);

I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster 
mode. Below are some arguments I am using:
—executor-memory 10G
—num-executors 50
—executor-cores 2

Storage Using:
When caching:
Cached Partitions 951
Fraction Cached 100%
Size in Memory 215.7GB
Size in Tachyon 0.0B
Size on Disk 1029.7MB

The time cost by every aggregate is around 5 minutes with cache enabled. Lots 
of disk IOs can be seen on the hadoop node. I have the same result with cache 
disabled.

Should data points caching improve the performance? Should caching decrease the 
disk IO?

Thanks in advance.

Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
You can specify a custom name with the --name option. It will still contain
1.1.0-SNAPSHOT, but at least you can specify your company name.

If you want to replace SNAPSHOT with your company name, you will have to
edit make-distribution.sh and replace the following line:

VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep
-v INFO | tail -n 1)

with something like

COMPANYNAME=SoullessMegaCorp
VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep
-v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME')

and so on for other packages that use their own version scheme.




On Tue, Jul 1, 2014 at 9:21 AM, majian maj...@nq.com wrote:

   Hi,all:

  I'm working to compile spark by executing './make-distribution.sh --hadoop
 0.20.205.0 --tgz  ',
  after the completion of the compilation I found  that the default version 
 number is 1.1.0-SNAPSHOT
  i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz,
  who know how to assign version number myself ,
 for example spark-1.1.0-company-bin-0.20.205.tgz  .


 Thanks,
 majian




Re: build spark assign version number myself?

2014-07-01 Thread Guillaume Ballet
Sorry, there's a typo in my previous post, the line should read:
VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null | grep
-v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME/g')


On Tue, Jul 1, 2014 at 10:35 AM, Guillaume Ballet gbal...@gmail.com wrote:

 You can specify a custom name with the --name option. It will still
 contain 1.1.0-SNAPSHOT, but at least you can specify your company name.

 If you want to replace SNAPSHOT with your company name, you will have to
 edit make-distribution.sh and replace the following line:

 VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null |
 grep -v INFO | tail -n 1)

 with something like

 COMPANYNAME=SoullessMegaCorp
 VERSION=$(mvn help:evaluate -Dexpression=project.version 2/dev/null |
 grep -v INFO | tail -n 1 | sed -e 's/SNAPSHOT/$COMPANYNAME')

 and so on for other packages that use their own version scheme.




 On Tue, Jul 1, 2014 at 9:21 AM, majian maj...@nq.com wrote:

   Hi,all:

  I'm working to compile spark by executing './make-distribution.sh --hadoop
 0.20.205.0 --tgz  ',
  after the completion of the compilation I found  that the default version 
 number is 1.1.0-SNAPSHOT
  i.e. spark-1.1.0-SNAPSHOT-bin-0.20.205.tgz,
  who know how to assign version number myself ,
 for example spark-1.1.0-company-bin-0.20.205.tgz  .


 Thanks,
 majian






RSpark installation on Windows

2014-07-01 Thread Stuti Awasthi
Hi All

Can we install RSpark on windows setup of R and use it to access the remote 
Spark cluster ?

Thanks
Stuti Awasthi


::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi,

The window size in a spark streaming is time based which means we have 
different number of elements in each window. For example if you have two 
streams (might be more) which are related to each other and you want to compare 
them in a specific time interval. I am not clear how it will work. Although 
they start running simultaneously, they might have different number of elements 
in each time interval.

The following is output for two streams which have same number of elements and 
ran simultaneously. The left most value is the number of elements in each 
window. If we add the number of elements them, they are same for both streams 
but we can't compare both streams as they are different in window size and 
number of windows.

Can we somehow make windows based on real time values for both streams? or Can 
we make windows based on number of elements?

(n, (mean, varience, SD))

Stream 1

(7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
(44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
(245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
(154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
(156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
(156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
(156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))


Stream 2

(10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
(180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
(179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
(179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
(269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
(101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))


Regards,
Laeeq

Error: UnionPartition cannot be cast to org.apache.spark.rdd.HadoopPartition

2014-07-01 Thread Honey Joshi
Hi,
I am trying to run a project which takes data as a DStream and dumps the
data in the Shark table after various operations. I am getting the
following error :

Exception in thread main org.apache.spark.SparkException: Job aborted:
Task 0.0:0 failed 1 times (most recent failure: Exception failure:
java.lang.ClassCastException: org.apache.spark.rdd.UnionPartition cannot
be cast to org.apache.spark.rdd.HadoopPartition)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Can someone please explain the cause of this error, I am also using a
Spark Context with the existing Streaming Context.


Window Size

2014-07-01 Thread Laeeq Ahmed
Hi,

The window size in a spark streaming is time based which means we have 
different number of elements in each window. For example if you have two 
streams (might be more) which are related to each other and you want to compare 
them in a specific time interval. I am not clear how it will 
work. Although they start running simultaneously, they might have 
different number of elements in each time interval.

The following is output for two streams which have same number of elements 
and ran simultaneously. The left most value is the number of elements in each 
window. If we add the number of elements them, they are same for 
both streams but we can't compare both streams as they are different in 
window size and number of windows.

Can we somehow make windows based on real time values for both streams? or Can 
we make windows based on number of elements?

(n, (mean, varience, SD))

Stream 1

(7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
(44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
(245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
(154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
(156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
(156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
(156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))


Stream 2

(10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
(180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
(179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
(179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
(269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
(101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))


Regards,Laeeq


java.io.FileNotFoundException: http://IP/broadcast_1

2014-07-01 Thread Honey Joshi
Hi All,

We are using shark table to dump the data, we are getting the following
error :

Exception in thread main org.apache.spark.SparkException: Job aborted:
Task 1.0:0 failed 1 times (most recent failure: Exception failure:
java.io.FileNotFoundException: http://IP/broadcast_1)

We dont know where the error is coming from, can anyone please explain me
the casue of this error and how to handle it. The spark.cleaner.ttl is set
to 4600, which i guess is more than enough to run the application.
Spark Version : 0.9.0-incubating
Shark : 0.9.0 - SNAPSHOT
Scala : 2.10.3

Thank You
Honey Joshi
Ideata Analytics





Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW


 Hi ,

I am using Spark Standalone mode with one master and 2 slaves.I am not  able to 
start the workers and connect it to the master using 


./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077

The log says

Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/x.x.x.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address

When I try to start the worker from the slaves using the following java 
command,its running without any exception

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077





Thanks  Regards, 
Meethu M

Re: Failed to launch Worker

2014-07-01 Thread Akhil Das
Is this command working??

java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/
assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
-XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077

Thanks
Best Regards


On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:


  Hi ,

 I am using Spark Standalone mode with one master and 2 slaves.I am not
  able to start the workers and connect it to the master using

 ./bin/spark-class org.apache.spark.deploy.worker.Worker
 spark://x.x.x.174:7077

 The log says

 Exception in thread main org.jboss.netty.channel.ChannelException:
 Failed to bind to: master/x.x.x.174:0
  at
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
  ...
 Caused by: java.net.BindException: Cannot assign requested address

 When I try to start the worker from the slaves using the following java
 command,its running without any exception

 java -cp
 ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
 org.apache.spark.deploy.worker.Worker spark://:master:7077




 Thanks  Regards,
 Meethu M



Re: Failed to launch Worker

2014-07-01 Thread MEETHU MATHEW
Yes.
 
Thanks  Regards, 
Meethu M


On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com wrote:
 


Is this command working??

java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077



Thanks
Best Regards


On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in wrote:



 Hi ,


I am using Spark Standalone mode with one master and 2 slaves.I am not  able 
to start the workers and connect it to the master using 


./bin/spark-class org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077


The log says


Exception in thread main org.jboss.netty.channel.ChannelException: Failed to 
bind to: master/x.x.x.174:0
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
...
Caused by: java.net.BindException: Cannot assign requested address


When I try to start the worker from the slaves using the following java 
command,its running without any exception


java -cp 
::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.worker.Worker spark://:master:7077









Thanks  Regards, 
Meethu M

RE: Spark 1.0 and Logistic Regression Python Example

2014-07-01 Thread Sam Jacobs
Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade 
the numpy/python for a permanent fix. My current versions of python and numpy 
are 2.6 and 4.1.9 respectively.

Thanks,

Sam  

-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com] 
Sent: Tuesday, July 01, 2014 12:14 AM
To: user@spark.apache.org
Subject: Re: Spark 1.0 and Logistic Regression Python Example

You were using an old version of numpy, 1.4? I think this is fixed in the 
latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use 
the latest master. -Xiangrui

On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs sam.jac...@us.abb.com wrote:
 Hi,


 I modified the example code for logistic regression to compute the 
 error in classification. Please see below. However the code is failing 
 when it makes a call to:


 labelsAndPreds.filter(lambda (v, p): v != p).count()


 with the error message (something related to numpy or dot product):


 File 
 /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py,
 line 65, in predict

 margin = _dot(x, self._coeff) + self._intercept

   File /opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py, 
 line 443, in _dot

 return vec.dot(target)

 AttributeError: 'numpy.ndarray' object has no attribute 'dot'


 FYI, I am running the code using spark-submit i.e.


 ./bin/spark-submit 
 examples/src/main/python/mllib/logistic_regression2.py



 The code is posted below if it will be useful in any way:


 from math import exp

 import sys
 import time

 from pyspark import SparkContext

 from pyspark.mllib.classification import LogisticRegressionWithSGD 
 from pyspark.mllib.regression import LabeledPoint from numpy import 
 array


 # Load and parse the data
 def parsePoint(line):
 values = [float(x) for x in line.split(',')]
 if values[0] == -1:   # Convert -1 labels to 0 for MLlib
 values[0] = 0
 return LabeledPoint(values[0], values[1:])

 sc = SparkContext(appName=PythonLR)
 # start timing
 start = time.time()
 #start = time.clock()

 data = sc.textFile(sWAMSpark_train.csv)
 parsedData = data.map(parsePoint)

 # Build the model
 model = LogisticRegressionWithSGD.train(parsedData)

 #load test data

 testdata = sc.textFile(sWSpark_test.csv) parsedTestData = 
 testdata.map(parsePoint)

 # Evaluating the model on test data
 labelsAndPreds = parsedTestData.map(lambda p: (p.label,
 model.predict(p.features)))
 trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
 float(parsedData.count())
 print(Training Error =  + str(trainErr)) end = time.time() 
 print(Time is =  + str(end - start))









Re: Changing log level of spark

2014-07-01 Thread Philip Limbeck
We changed the loglevel to DEBUG by replacing every INFO with DEBUG in
/root/ephemeral-hdfs/conf/log4j.properties and propagating it to the
cluster. There is some DEBUG output visible in both master and worker but
nothing really interesting regarding stages or scheduling. Since we
expected a little more than that, there could be 2 possibilites:
  a) There is still some other unknown way to set the loglevel to debug
  b) There is not that much log output to be expected in this direction, I
looked for logDebug (The log wrapper in spark) in github with 84 results,
which means that I doubt that there is not much else to expect.

We actually just want to have a little more insight into the system
behavior especially when using Shark since we ran into some serious
concurrency issues with blocking queries. So much for the background why
this is important to us.


On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson ilike...@gmail.com wrote:

 If you're using the spark-ec2 scripts, you may have to change
 /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
 is added to the classpath before Spark's own conf.


 On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 I have a log4j.xml in src/main/resources with

 ?xml version=1.0 encoding=UTF-8 ?
 !DOCTYPE log4j:configuration SYSTEM log4j.dtd
 log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/;
 [...]
 root
 priority value =warn /
 appender-ref ref=Console /
 /root
 /log4j:configuration

 and that is included in the jar I package with `sbt assembly`. That
 works fine for me, at least on the driver.

 Tobias

 On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck philiplimb...@gmail.com
 wrote:
  Hi!
 
  According to
 
 https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging
 ,
  changing log-level is just a matter of creating a log4j.properties
 (which is
  in the classpath of spark) and changing log level there for the root
 logger.
  I did this steps on every node in the cluster (master and worker nodes).
  However, after restart there is still no debug output as desired, but
 only
  the default info log level.





difference between worker and slave nodes

2014-07-01 Thread aminn_524

 
Can anyone explain to me what is difference between worker and slave? I hav
e one master and two slaves which are connected to each other, by using jps
command I can see master in master node and worker in slave nodes but I dont
have any worker in my master by using this command
/bin/spark-classorg.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077
I have thought the slaves would be working as worker



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/difference-between-worker-and-slave-nodes-tp8578.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Changing log level of spark

2014-07-01 Thread Surendranauth Hiraman
One thing we ran into was that there was another log4j.properties earlier
in the classpath. For us, it was in our MapR/Hadoop conf.

If that is the case, something like the following could help you track it
down. The only thing to watch out for is that you might have to walk up the
classloader hierarchy.

ClassLoader cl = Thread.currentThread().getContextClassLoader();
URL loc = cl.getResource(/log4j.properties);
System.out.println(loc);

-Suren




On Tue, Jul 1, 2014 at 9:20 AM, Philip Limbeck philiplimb...@gmail.com
wrote:

 We changed the loglevel to DEBUG by replacing every INFO with DEBUG in
 /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the
 cluster. There is some DEBUG output visible in both master and worker but
 nothing really interesting regarding stages or scheduling. Since we
 expected a little more than that, there could be 2 possibilites:
   a) There is still some other unknown way to set the loglevel to debug
   b) There is not that much log output to be expected in this direction, I
 looked for logDebug (The log wrapper in spark) in github with 84 results,
 which means that I doubt that there is not much else to expect.

 We actually just want to have a little more insight into the system
 behavior especially when using Shark since we ran into some serious
 concurrency issues with blocking queries. So much for the background why
 this is important to us.


 On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 If you're using the spark-ec2 scripts, you may have to change
 /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
 is added to the classpath before Spark's own conf.


 On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 I have a log4j.xml in src/main/resources with

 ?xml version=1.0 encoding=UTF-8 ?
 !DOCTYPE log4j:configuration SYSTEM log4j.dtd
 log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/;
 [...]
 root
 priority value =warn /
 appender-ref ref=Console /
 /root
 /log4j:configuration

 and that is included in the jar I package with `sbt assembly`. That
 works fine for me, at least on the driver.

 Tobias

 On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck philiplimb...@gmail.com
 wrote:
  Hi!
 
  According to
 
 https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging
 ,
  changing log-level is just a matter of creating a log4j.properties
 (which is
  in the classpath of spark) and changing log level there for the root
 logger.
  I did this steps on every node in the cluster (master and worker
 nodes).
  However, after restart there is still no debug output as desired, but
 only
  the default info log level.






-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


Re: Changing log level of spark

2014-07-01 Thread Yana Kadiyska
Are you looking at the driver log? (e.g. Shark?). I see a ton of
information in the INFO category on what query is being started, what
stage is starting and which executor stuff is sent to. So I'm not sure
if you're saying you see all that and you need more, or that you're
not seeing this type of information. I cannot speak to the ec2 setup,
just pointing out that under 0.9.1 I see quite a bit of scheduling
information in the driver log.

On Tue, Jul 1, 2014 at 9:20 AM, Philip Limbeck philiplimb...@gmail.com wrote:
 We changed the loglevel to DEBUG by replacing every INFO with DEBUG in
 /root/ephemeral-hdfs/conf/log4j.properties and propagating it to the
 cluster. There is some DEBUG output visible in both master and worker but
 nothing really interesting regarding stages or scheduling. Since we expected
 a little more than that, there could be 2 possibilites:
   a) There is still some other unknown way to set the loglevel to debug
   b) There is not that much log output to be expected in this direction, I
 looked for logDebug (The log wrapper in spark) in github with 84 results,
 which means that I doubt that there is not much else to expect.

 We actually just want to have a little more insight into the system behavior
 especially when using Shark since we ran into some serious concurrency
 issues with blocking queries. So much for the background why this is
 important to us.


 On Thu, Jun 26, 2014 at 3:30 AM, Aaron Davidson ilike...@gmail.com wrote:

 If you're using the spark-ec2 scripts, you may have to change
 /root/ephemeral-hdfs/conf/log4j.properties or something like that, as that
 is added to the classpath before Spark's own conf.


 On Wed, Jun 25, 2014 at 6:10 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 I have a log4j.xml in src/main/resources with

 ?xml version=1.0 encoding=UTF-8 ?
 !DOCTYPE log4j:configuration SYSTEM log4j.dtd
 log4j:configuration xmlns:log4j=http://jakarta.apache.org/log4j/;
 [...]
 root
 priority value =warn /
 appender-ref ref=Console /
 /root
 /log4j:configuration

 and that is included in the jar I package with `sbt assembly`. That
 works fine for me, at least on the driver.

 Tobias

 On Wed, Jun 25, 2014 at 2:25 PM, Philip Limbeck philiplimb...@gmail.com
 wrote:
  Hi!
 
  According to
 
  https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging,
  changing log-level is just a matter of creating a log4j.properties
  (which is
  in the classpath of spark) and changing log level there for the root
  logger.
  I did this steps on every node in the cluster (master and worker
  nodes).
  However, after restart there is still no debug output as desired, but
  only
  the default info log level.





Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Uddin, Nasir M.
Dear Spark Users:

Spark 1.0 has been installed as Standalone - But it can't read any compressed 
(CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed files 
from HDFS). The key notable message is: Unable to load native-hadoop 
library.. Other related messages are -

Caused by: java.lang.IllegalStateException: Cannot load 
com.ibm.biginsights.compress.CmxDecompressor without native library! at 
com.ibm.biginsights.compress.CmxDecompressor.clinit(CmxDecompressor.java:65)

Here is the core-site.xml's key part:
nameio.compression.codecs/name
valueorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.ibm.biginsights.compress.CmxCodec/value
  /property

Here is the spark.env.sh:
export SPARK_WORKER_CORES=4
export SPARK_WORKER_MEMORY=10g
export SCALA_HOME=/opt/spark/scala-2.11.1
export JAVA_HOME=/opt/spark/jdk1.7.0_55
export SPARK_HOME=/opt/spark/spark-0.9.1-bin-hadoop2
export ADD_JARS=/opt/IHC/lib/compression.jar
export SPARK_CLASSPATH=/opt/IHC/lib/compression.jar
export SPARK_LIBRARY_PATH=/opt/IHC/lib/native/Linux-amd64-64/
export SPARK_MASTER_WEBUI_PORT=1080
export HADOOP_CONF_DIR=/opt/IHC/hadoop-conf

Note: core-site.xml and hdfs-site.xml are in hadoop-conf. CMX is an IBM branded 
splittable LZO based compression codec.

Any help to resolve the issue is appreciated.

Thanks,
Nasir
DTCC DISCLAIMER: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they are 
addressed. If you have received this email in error, please notify us 
immediately and delete the email and any attachments from your system. The 
recipient should check this email and any attachments for the presence of 
viruses.  The company accepts no liability for any damage caused by any virus 
transmitted by this email.


Re: Spark Streaming question batch size

2014-07-01 Thread Yana Kadiyska
Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define real time values for both
streams? I am trying to do something similar to you, I think -- but
I'm not clear on what your notion of time is.
My reading of your example above is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493

On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
 Hi,

 The window size in a spark streaming is time based which means we have
 different number of elements in each window. For example if you have two
 streams (might be more) which are related to each other and you want to
 compare them in a specific time interval. I am not clear how it will work.
 Although they start running simultaneously, they might have different number
 of elements in each time interval.

 The following is output for two streams which have same number of elements
 and ran simultaneously. The left most value is the number of elements in
 each window. If we add the number of elements them, they are same for both
 streams but we can't compare both streams as they are different in window
 size and number of windows.

 Can we somehow make windows based on real time values for both streams? or
 Can we make windows based on number of elements?

 (n, (mean, varience, SD))
 Stream 1
 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
 (44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
 (245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
 (154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
 (156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
 (156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
 (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))

 Stream 2
 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
 (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
 (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
 (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
 (269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
 (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))

 Regards,
 Laeeq




Re: Question about VD and ED

2014-07-01 Thread Baoxu Shi(Dash)
Hi Bin,

VD and ED are ClassTags, you could treat them as placeholder, or template T in 
C (not 100% clear).

You do not need convert graph[String, Double] to Graph[VD,ED].

Check ClassTag’s definition in Scala could help.

Best,

On Jul 1, 2014, at 4:49 AM, Bin WU bw...@connect.ust.hk wrote:

 Hi all,
 
 I am a newbie to graphx.
 
 I am currently having troubles understanding the types VD and ED. I 
 notice that VD and ED are widely used in graphx implementation, but I 
 don't know why and how I am supposed to use them.
 
 Specifically, say I have constructed a graph graph : Graph[String, Double], 
 when, why, and how should I transform it into the type Graph[VD, ED]?
 
 Also, I don't know what package should I import. I have imported:
 
 
 import org.apache.spark.graphx._
 import org.apache.spark.rdd.RDD
 
 But the two types VD and ED are still not found.
 
 Sorry for the stupid question.
 
 Thanks in advance!
 
 Best,
 Ben



Re: SparkKMeans.scala from examples will show: NoClassDefFoundError: breeze/linalg/Vector

2014-07-01 Thread Xiangrui Meng
You can use either bin/run-example or bin/spark-summit to run example
code. scalac -d classes/ SparkKMeans.scala doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui

On Tue, Jul 1, 2014 at 4:39 AM, Wanda Hawk wanda_haw...@yahoo.com wrote:
 Hello,

 I have installed spark-1.0.0 with scala2.10.3. I have built spark with
 sbt/sbt assembly and added
 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 to my CLASSPATH variable.
 Then I went here
 ../spark-1.0.0/examples/src/main/scala/org/apache/spark/examples created a
 new directory classes and compiled SparkKMeans.scala with scalac -d
 classes/ SparkKMeans.scala
 Then I navigated to classes (I commented this line in the scala file :
 package org.apache.spark.examples ) and tried to run it with java -cp .
 SparkKMeans and I get the following error:
 Exception in thread main java.lang.NoClassDefFoundError:
 breeze/linalg/Vector
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
 at java.lang.Class.getMethod0(Class.java:2774)
 at java.lang.Class.getMethod(Class.java:1663)
 at
 sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
 at
 sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: breeze.linalg.Vector
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 6 more
 
 The jar under
 /home/wanda/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 contains the breeze/linalg/Vector* path, I even tried to unpack it and put
 it in CLASSPATH to it does not seem to pick it up


 I am currently running java 1.8
 java version 1.8.0_05
 Java(TM) SE Runtime Environment (build 1.8.0_05-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)

 What I am doing wrong ?



Re: Questions about disk IOs

2014-07-01 Thread Xiangrui Meng
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.

PR: https://github.com/apache/spark/pull/1110

-Xiangrui

On Tue, Jul 1, 2014 at 12:55 AM, Charles Li littlee1...@gmail.com wrote:
 Hi Spark,

 I am running LBFGS on our user data. The data size with Kryo serialisation is 
 about 210G. The weight size is around 1,300,000. I am quite confused that the 
 performance is very close whether the data is cached or not.

 The program is simple:
 points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
 points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
 gradient = new LogisticGrandient();
 updater = new SquaredL2Updater();
 initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
 result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
 convergeTol, maxIter, regParam, initWeight);

 I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its 
 cluster mode. Below are some arguments I am using:
 —executor-memory 10G
 —num-executors 50
 —executor-cores 2

 Storage Using:
 When caching:
 Cached Partitions 951
 Fraction Cached 100%
 Size in Memory 215.7GB
 Size in Tachyon 0.0B
 Size on Disk 1029.7MB

 The time cost by every aggregate is around 5 minutes with cache enabled. Lots 
 of disk IOs can be seen on the hadoop node. I have the same result with cache 
 disabled.

 Should data points caching improve the performance? Should caching decrease 
 the disk IO?

 Thanks in advance.


Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Aditya Varun Chadha
I attended yesterday on ustream.tv, but can't find the links to today's
streams anywhere. help!

-- 
Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)


Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Alexis Roos
*General Session / Keynotes :
 http://www.ustream.tv/channel/spark-summit-2014
http://www.ustream.tv/channel/spark-summit-2014Track A
: http://www.ustream.tv/channel/track-a1
http://www.ustream.tv/channel/track-a1Track
B: http://www.ustream.tv/channel/track-b1
http://www.ustream.tv/channel/track-b1Track
C: http://www.ustream.tv/channel/track-c1
http://www.ustream.tv/channel/track-c1*


On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com
wrote:

 I attended yesterday on ustream.tv, but can't find the links to today's
 streams anywhere. help!

 --
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)



Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tobias,

Your explanation makes a lot of sense. Actually, I tried to use partial
data on the same program yesterday. It has been up for around 24 hours and
is still running correctly. Thanks!

Bill


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 let's say the processing time is t' and the window size t. Spark does not
 *require* t'  t. In fact, for *temporary* peaks in your streaming data, I
 think the way Spark handles it is very nice, in particular since 1) it does
 not mix up the order in which items arrived in the stream, so items from a
 later window will always be processed later, and 2) because an increase in
 data will not be punished with high load and unresponsive systems, but with
 disk space consumption instead.

 However, if all of your windows require t'  t processing time (and it's
 not because you are waiting, but because you actually do some computation),
 then you are in bad luck, because if you start processing the next window
 while the previous one is still processed, you have less resources for each
 and processing will take even longer. However, if you are only waiting
 (e.g., for network I/O), then maybe you can employ some asynchronous
 solution where your tasks return immediately and deliver their result via a
 callback later?

 Tobias



 On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does
 Spark always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 








Re: Could not compute split, block not found

2014-07-01 Thread Bill Jay
Hi Tathagata,

Yes. The input stream is from Kafka and my program reads the data, keeps
all the data in memory, process the data, and generate the output.

Bill


On Mon, Jun 30, 2014 at 11:45 PM, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 Are you by any change using only memory in the storage level of the input
 streams?

 TD


 On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Bill,

 let's say the processing time is t' and the window size t. Spark does not
 *require* t'  t. In fact, for *temporary* peaks in your streaming data, I
 think the way Spark handles it is very nice, in particular since 1) it does
 not mix up the order in which items arrived in the stream, so items from a
 later window will always be processed later, and 2) because an increase in
 data will not be punished with high load and unresponsive systems, but with
 disk space consumption instead.

 However, if all of your windows require t'  t processing time (and it's
 not because you are waiting, but because you actually do some computation),
 then you are in bad luck, because if you start processing the next window
 while the previous one is still processed, you have less resources for each
 and processing will take even longer. However, if you are only waiting
 (e.g., for network I/O), then maybe you can employ some asynchronous
 solution where your tasks return immediately and deliver their result via a
 callback later?

 Tobias



 On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Your suggestion is very helpful. I will definitely investigate it.

 Just curious. Suppose the batch size is t seconds. In practice, does
 Spark always require the program to finish processing the data of t seconds
 within t seconds' processing time? Can Spark begin to consume the new batch
 before finishing processing the next batch? If Spark can do them together,
 it may save the processing time and solve the problem of data piling up.

 Thanks!

 Bill




 On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 ​​If your batch size is one minute and it takes more than one minute to
 process, then I guess that's what causes your problem. The processing of
 the second batch will not start after the processing of the first is
 finished, which leads to more and more data being stored and waiting for
 processing; check the attached graph for a visualization of what I think
 may happen.

 Can you maybe do something hacky like throwing away a part of the data
 so that processing time gets below one minute, then check whether you still
 get that error?

 Tobias


 ​​


 On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Tobias,

 Thanks for your help. I think in my case, the batch size is 1 minute.
 However, it takes my program more than 1 minute to process 1 minute's
 data. I am not sure whether it is because the unprocessed data pile
 up. Do you have an suggestion on how to check it and solve it? Thanks!

 Bill


 On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer t...@preferred.jp
 wrote:

 Bill,

 were you able to process all information in time, or did maybe some
 unprocessed data pile up? I think when I saw this once, the reason
 seemed to be that I had received more data than would fit in memory,
 while waiting for processing, so old data was deleted. When it was
 time to process that data, it didn't exist any more. Is that a
 possible reason in your case?

 Tobias

 On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay bill.jaypeter...@gmail.com
 wrote:
  Hi,
 
  I am running a spark streaming job with 1 minute as the batch size.
 It ran
  around 84 minutes and was killed because of the exception with the
 following
  information:
 
  java.lang.Exception: Could not compute split, block
 input-0-1403893740400
  not found
 
 
  Before it was killed, it was able to correctly generate output for
 each
  batch.
 
  Any help on this will be greatly appreciated.
 
  Bill
 









spark streaming rate limiting from kafka

2014-07-01 Thread Chen Song
In my use case, if I need to stop spark streaming for a while, data would
accumulate a lot on kafka topic-partitions. After I restart spark streaming
job, the worker's heap will go out of memory on the fetch of the 1st batch.

I am wondering if

* Is there a way to throttle reading from kafka in spark streaming jobs?
* Is there a way to control how far Kafka Dstream can read on
topic-partition (via offset for example). By setting this to a small
number, it will force DStream to read less data initially.
* Is there a way to limit the consumption rate at Kafka side? (This one is
not actually for spark streaming and doesn't seem to be question in this
group. But I am raising it anyway here.)

I have looked at code example below but doesn't seem it is supported.

KafkaUtils.createStream ...
Thanks, All
-- 
Chen Song


Re: Improving Spark multithreaded performance?

2014-07-01 Thread Kyle Ellrott
This all seems pretty hackish and a lot of trouble to get around
limitations in mllib.
The big limitation is that right now, the optimization algorithms work on
one large dataset at a time. We need a second of set of methods to work on
a large number of medium sized datasets.
I've started to code a new set of optimization methods to add into mllib.
I've started with GroupedGradientDecent (
https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala
)

GroupedGradientDecent is based on GradientDecent, but instead, it takes
RDD[(Int, (Double, Vector))] as its data input rather then RDD[(Double,
Vector)]. The Int serves as key to mark which elements should be grouped
together. This lets you multiplex several dataset optimizations into the
same RDD.

I think I've gotten the GroupedGradientDecent to work correctly. I need to
go up the stack and start adding methods like SVMWithSGD.trainGroup.

Does anybody have any thoughts on this?

Kyle



On Fri, Jun 27, 2014 at 6:36 PM, Xiangrui Meng men...@gmail.com wrote:

 The RDD is cached in only one or two workers. All other executors need
 to fetch its content via network. Since the dataset is not huge, could
 you try this?

 val features: Array[Vector] = ...
 val featuresBc = sc.broadcast(features)
  // parallel loops
  val labels: Array[Double] =
  val rdd = sc.parallelize(0 until 1, 1).flatMap(i =
 featuresBc.value.view.zip(labels))
  val model = SVMWithSGD.train(rdd)
  models(i) = model

 Using BT broadcast factory would improve the performance of broadcasting.

 Best,
 Xiangrui

 On Fri, Jun 27, 2014 at 3:06 PM, Kyle Ellrott kellr...@soe.ucsc.edu
 wrote:
  1) I'm using the static SVMWithSGD.train, with no options.
  2) I have about 20,000 features (~5000 samples) that are being attached
 and
  trained against 14,000 different sets of labels (ie I'll be doing 14,000
  different training runs against the same sets of features trying to
 figure
  out which labels can be learned), and I would also like to do cross fold
  validation.
 
  The driver doesn't seem to be using too much memory. I left it as -Xmx8g
 and
  it never complained.
 
  Kyle
 
 
 
  On Fri, Jun 27, 2014 at 1:18 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi Kyle,
 
  A few questions:
 
  1) Did you use `setIntercept(true)`?
  2) How many features?
 
  I'm a little worried about driver's load because the final aggregation
  and weights update happen on the driver. Did you check driver's memory
  usage as well?
 
  Best,
  Xiangrui
 
  On Fri, Jun 27, 2014 at 8:10 AM, Kyle Ellrott kellr...@soe.ucsc.edu
  wrote:
   As far as I can tell there are is no data to broadcast (unless there
 is
   something internal to mllib that needs to be broadcast) I've coalesced
   the
   input RDDs to keep the number of partitions limited. When running,
 I've
   tried to get up to 500 concurrent stages, and I've coalesced the RDDs
   down
   to 2 partitions, so about 1000 tasks.
   Despite having over 500 threads in the threadpool working on mllib
   tasks,
   the total CPU usage never really goes above 150%.
   I've tried increasing 'spark.akka.threads' but that doesn't seem to do
   anything.
  
   My one thought would be that maybe because I'm using MLUtils.kFold to
   generate the RDDs is that because I have so many tasks working off
 RDDs
   that
   are permutations of original RDDs that maybe that is creating some
 sort
   of
   dependency bottleneck.
  
   Kyle
  
  
   On Thu, Jun 26, 2014 at 6:35 PM, Aaron Davidson ilike...@gmail.com
   wrote:
  
   I don't have specific solutions for you, but the general things to
 try
   are:
  
   - Decrease task size by broadcasting any non-trivial objects.
   - Increase duration of tasks by making them less fine-grained.
  
   How many tasks are you sending? I've seen in the past something like
 25
   seconds for ~10k total medium-sized tasks.
  
  
   On Thu, Jun 26, 2014 at 12:06 PM, Kyle Ellrott 
 kellr...@soe.ucsc.edu
   wrote:
  
   I'm working to set up a calculation that involves calling mllib's
   SVMWithSGD.train several thousand times on different permutations of
   the
   data. I'm trying to run the separate jobs using a threadpool to
   dispatch the
   different requests to a spark context connected a Mesos's cluster,
   using
   course scheduling, and a max of 2000 cores on Spark 1.0.
   Total utilization of the system is terrible. Most of the 'aggregate
 at
   GradientDescent.scala:178' stages(where mllib spends most of its
 time)
   take
   about 3 seconds, but have ~25 seconds of scheduler delay time.
   What kind of things can I do to improve this?
  
   Kyle
  
  
  
 
 



Re: spark streaming rate limiting from kafka

2014-07-01 Thread Luis Ángel Vicente Sánchez
Maybe reducing the batch duration would help :\


2014-07-01 17:57 GMT+01:00 Chen Song chen.song...@gmail.com:

 In my use case, if I need to stop spark streaming for a while, data would
 accumulate a lot on kafka topic-partitions. After I restart spark streaming
 job, the worker's heap will go out of memory on the fetch of the 1st batch.

 I am wondering if

 * Is there a way to throttle reading from kafka in spark streaming jobs?
 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.
 * Is there a way to limit the consumption rate at Kafka side? (This one is
 not actually for spark streaming and doesn't seem to be question in this
 group. But I am raising it anyway here.)

 I have looked at code example below but doesn't seem it is supported.

 KafkaUtils.createStream ...
 Thanks, All
 --
 Chen Song




Re: Re: spark table to hive table

2014-07-01 Thread John Omernik
Michael -

Does Spark SQL support rlike and like yet? I am running into that same
error with a basic select * from table where field like '%foo%' using the
hql() funciton.

Thanks




On Wed, May 28, 2014 at 2:22 PM, Michael Armbrust mich...@databricks.com
wrote:

 On Tue, May 27, 2014 at 6:08 PM, JaeBoo Jung itsjb.j...@samsung.com
 wrote:

  I already tried HiveContext as well as SqlContext.

 But it seems that Spark's HiveContext is not completely same as Apache
 Hive.

 For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST
 LIMIT 10' works fine in Apache Hive,

 Spark SQL doesn't support window functions yet (SPARK-1442
 https://issues.apache.org/jira/browse/SPARK-1442).  Sorry for the
 non-obvious error message!



Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Are these sessions recorded ?


On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:







 *General Session / Keynotes :
  http://www.ustream.tv/channel/spark-summit-2014
 http://www.ustream.tv/channel/spark-summit-2014 Track A
 : http://www.ustream.tv/channel/track-a1
 http://www.ustream.tv/channel/track-a1Track
 B: http://www.ustream.tv/channel/track-b1
 http://www.ustream.tv/channel/track-b1 Track
 C: http://www.ustream.tv/channel/track-c1
 http://www.ustream.tv/channel/track-c1*


 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com
 wrote:

 I attended yesterday on ustream.tv, but can't find the links to today's
 streams anywhere. help!

 --
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)





Re: Failed to launch Worker

2014-07-01 Thread Aaron Davidson
Where are you running the spark-class version? Hopefully also on the
workers.

If you're trying to centrally start/stop all workers, you can add a
slaves file to the spark conf/ directory which is just a list of your
hosts, one per line. Then you can just use ./sbin/start-slaves.sh to
start the worker on all of your machines.

Note that this is already setup correctly if you're using the spark-ec2
scripts.


On Tue, Jul 1, 2014 at 5:53 AM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:

 Yes.

 Thanks  Regards,
 Meethu M


   On Tuesday, 1 July 2014 6:14 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:


  Is this command working??

 java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/
 assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m
 -Xmx512m org.apache.spark.deploy.worker.Worker spark://x.x.x.174:7077

 Thanks
 Best Regards


 On Tue, Jul 1, 2014 at 6:08 PM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:


  Hi ,

 I am using Spark Standalone mode with one master and 2 slaves.I am not
  able to start the workers and connect it to the master using

 ./bin/spark-class org.apache.spark.deploy.worker.Worker
 spark://x.x.x.174:7077

 The log says

  Exception in thread main org.jboss.netty.channel.ChannelException:
 Failed to bind to: master/x.x.x.174:0
  at
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
  ...
  Caused by: java.net.BindException: Cannot assign requested address

 When I try to start the worker from the slaves using the following java
 command,its running without any exception

 java -cp
 ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
 org.apache.spark.deploy.worker.Worker spark://:master:7077




 Thanks  Regards,
 Meethu M







Re: Spark Streaming question batch size

2014-07-01 Thread Laeeq Ahmed
Hi Yana,

Yes, that is what I am saying. I need both streams to be at same pace. I do 
have timestamps for each datapoint. There is a way suggested by Tathagata das 
in an earlier post where you have have a bigger window than required and you 
fetch your required data from that window based on your timestamps. I was just 
looking if there are other cleaner ways to do it.

Regards
Laeeq
 


On Tuesday, July 1, 2014 4:23 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
 


Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define real time values for both
streams? I am trying to do something similar to you, I think -- but
I'm not clear on what your notion of time is.
My reading of your example above is that the streams just pump data in
at different rates -- first one got 7462 points in the first batch
interval, whereas stream2 saw 10493


On Tue, Jul 1, 2014 at 5:34 AM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
 Hi,

 The window size in a spark streaming is time based which means we have
 different number of elements in each window. For example if you have two
 streams (might be more) which are related to each other and you want to
 compare them in a specific time interval. I am not clear how it will work.
 Although they start running simultaneously, they might have different number
 of elements in each time interval.

 The following is output for two streams which have same number of elements
 and ran simultaneously. The left most value is the number of elements in
 each window. If we add the number of elements them, they are same for both
 streams but we can't compare both streams as they are different in window
 size and number of windows.

 Can we somehow make windows based on real time values for both streams? or
 Can we make windows based on number of elements?

 (n, (mean, varience, SD))
 Stream 1
 (7462,(1.0535658165371238,4242.001306434091,65.13064798107025))
 (44826,(0.2546925855084064,5042.890184382894,71.0133099100647))
 (245466,(0.2857731601728941,5014.411691661449,70.81251084138628))
 (154852,(0.21907814309792514,3483.800160602281,59.023725404300606))
 (156345,(0.3075668844414613,7449.528181550462,86.31064929399189))
 (156603,(0.27785151491351234,5917.809892281489,76.9273026452994))
 (156047,(0.18130350363672296,4019.0232843737017,63.39576708561623))

 Stream 2
 (10493,(0.5554953964547791,1254.883548218503,35.42433553672536))
 (180649,(0.21684831234050583,1095.9634245399352,33.1053383087975))
 (179994,(0.22048869512317407,1443.0566458182718,37.98758541705792))
 (179455,(0.20473330254938552,1623.9538730448216,40.29831104456888))
 (269817,(0.16987953223480945,3270.663944782799,57.18971887308766))
 (101193,(0.21469292497504766,1263.0879032808723,35.53994799209577))

 Regards,
 Laeeq



why is toBreeze private everywhere in mllib?

2014-07-01 Thread Koert Kuipers
its kind of handy to be able to convert stuff to breeze... is there some
other way i am supposed to access that functionality?


[ANNOUNCE] Flambo - A Clojure DSL for Apache Spark

2014-07-01 Thread Soren Macbeth
Yieldbot is pleased to announce the release of Flambo, our Clojure DSL for
Apache Spark.

Flambo allows one to write spark applications in pure Clojure as an
alternative to Scala, Java and Python currently available in Spark.

We have already written a substantial amount of internal code in clojure
using flambo and we are excited to hear and see what other will come up
with.

As ever, Pull Request and/or Issues on Github are greatly appreciated!

You can find links to source, api docs and literate source code here:

http://bit.ly/V8FmzC

-- @sorenmacbeth


Re: why is toBreeze private everywhere in mllib?

2014-07-01 Thread Xiangrui Meng
We were not ready to expose it as a public API in v1.0. Both breeze
and MLlib are in rapid development. It would be possible to expose it
as a developer API in v1.1. For now, it should be easy to define a
toBreeze method in your own project. -Xiangrui

On Tue, Jul 1, 2014 at 12:17 PM, Koert Kuipers ko...@tresata.com wrote:
 its kind of handy to be able to convert stuff to breeze... is there some
 other way i am supposed to access that functionality?


Re: Spark SQL : Join throws exception

2014-07-01 Thread Yin Huai
Seems it is a bug. I have opened
https://issues.apache.org/jira/browse/SPARK-2339 to track it.

Thank you for reporting it.

Yin


On Tue, Jul 1, 2014 at 12:06 PM, Subacini B subac...@gmail.com wrote:

 Hi All,

 Running this join query
  sql(SELECT * FROM  A_TABLE A JOIN  B_TABLE B WHERE
 A.status=1).collect().foreach(println)

 throws

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 1.0:3 failed 4 times, most recent failure:
 Exception failure in TID 12 on host X.X.X.X: 
 *org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
 No function to evaluate expression. type: UnresolvedAttribute, tree:
 'A.status*

 org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.eval(unresolved.scala:59)

 org.apache.spark.sql.catalyst.expressions.Equals.eval(predicates.scala:147)

 org.apache.spark.sql.catalyst.expressions.And.eval(predicates.scala:100)

 org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)

 org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$1.apply(basicOperators.scala:52)
 scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:137)

 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$1.apply(Aggregate.scala:134)
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
 org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 java.lang.Thread.run(Thread.java:695)
 Driver stacktrace:

 Can someone help me.

 Thanks in advance.




spark-submit script and spark.files.userClassPathFirst

2014-07-01 Thread _soumya_
Hi,
 I'm trying to get rid of an error (NoSuchMethodError) while using Amazon's
s3 client on Spark. I'm using the Spark Submit script to run my code.
Reading about my options and other threads, it seemed the most logical way
would be make sure my jar is loaded first. Spark submit on debug shows the
same: 

spark.files.userClassPathFirst - true

However, I can't seem to get rid of the error at runtime. The issue I
believe is happening because Spark has the older httpclient library in it's
classpath. I'm using:

dependency
groupIdorg.apache.httpcomponents/groupId
artifactIdhttpclient/artifactId
version4.3/version
/dependency

Any clues what might be happening?

Stack trace below:


Exception in thread main java.lang.NoSuchMethodError:
org.apache.http.impl.conn.DefaultClientConnectionOperator.init(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
at
org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:138)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:112)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:101)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:87)
at
org.apache.http.impl.conn.PoolingClientConnectionManager.init(PoolingClientConnectionManager.java:95)
at
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
at
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
at com.amazonaws.http.AmazonHttpClient.init(AmazonHttpClient.java:158)
at
com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:119)
at
com.amazonaws.AmazonWebServiceClient.init(AmazonWebServiceClient.java:103)
at 
com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:357)
at 
com.amazonaws.services.s3.AmazonS3Client.init(AmazonS3Client.java:339)
at com.evocalize.rickshaw.commons.s3.S3Util.getS3Client(S3Util.java:35)
at
com.evocalize.rickshaw.commons.s3.S3Util.putFileFromLocalFS(S3Util.java:40)
at
com.evocalize.rickshaw.spark.actions.PackageFilesFunction.call(PackageFilesFunction.java:48)
at
com.evocalize.rickshaw.spark.applications.GenerateSEOContent.main(GenerateSEOContent.java:76)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-script-and-spark-files-userClassPathFirst-tp8604.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


slf4j multiple bindings

2014-07-01 Thread Bill Jay
Hi all,

I have an issue with multiple slf4j bindings. My program was running
correctly. I just added the new dependency kryo. And when I submitted a
job, the job was killed because of the following error messages:

*SLF4J: Class path contains multiple SLF4J bindings.*


The log said there were three slf4j bindings:
spark-assembly-0.9.1-hadoop2.3.0.jar, hadoop lib, and my own jar file.

However, I did not explicitly add slf4j in my pom.xml file. I added
exclusions in the dependency of kryo but it did not work. Does anyone has
an idea how to fix this issue? Thanks!

Regards,

Bill


Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mohammed Guller
I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3 
worker). Our app is fetching data from Cassandra and doing a basic filter, map, 
and countByKey on that data. I have run into a strange problem. Even if the 
number of rows in Cassandra is just 1M, the Spark job goes seems to go into an 
infinite loop and runs for hours. With a small amount of data (less than 100 
rows), the job does finish, but takes almost 30-40 seconds and we frequently 
see the messages shown below. If we run the same application on a single node 
Spark (--master local[4]), then we don't see these warnings and the task 
finishes in less than 6-7 seconds. Any idea what could be the cause for these 
problems when we run our application on a standalone 4-node spark cluster?

14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(0, 192.168.222.142, 39342, 0)
14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)
14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)
14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)
14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 6.17:0)
14/06/30 19:30:42 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:43 WARN TaskSetManager: Lost TID 30193 (task 6.18:0)
14/06/30 19:30:43 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(2, 192.168.222.164, 57185, 0)
14/06/30 19:30:45 WARN TaskSetManager: Lost TID 30559 (task 6.19:98)
14/06/30 19:30:45 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:46 WARN TaskSetManager: Lost TID 30826 (task 6.20:0)
14/06/30 19:30:46 WARN TaskSetManager: Loss was due to fetch failure from 
BlockManagerId(1, 192.168.222.152, 45896, 0)
14/06/30 19:30:48 WARN TaskSetManager: Lost TID 31098 (task 6.21:0)
14/06/30 19:30:48 WARN TaskSetManager: Loss was due to fetch failure from 

Re: multiple passes in mapPartitions

2014-07-01 Thread Chris Fregly
also, multiple calls to mapPartitions() will be pipelined by the spark
execution engine into a single stage, so the overhead is minimal.


On Fri, Jun 13, 2014 at 9:28 PM, zhen z...@latrobe.edu.au wrote:

 Thank you for your suggestion. We will try it out and see how it performs.
 We
 think the single call to mapPartitions will be faster but we could be
 wrong.
 It would be nice to have a clone method on the iterator.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/multiple-passes-in-mapPartitions-tp7555p7616.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Yana Kadiyska
A lot of things can get funny when you run distributed as opposed to
local -- e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines -- I'm guessing
192.168.222.152/192.168.222.164. From here
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
seems like the warning message is logged after the task fails -- but I
wonder if you might see something more useful as to why it failed to
begin with. As an example we've had cases in Hdfs where a small
example would work, but on a larger example we'd hit a bad file. But
the executor log is usually pretty explicit as to what happened...

On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com wrote:
 I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
 worker). Our app is fetching data from Cassandra and doing a basic filter,
 map, and countByKey on that data. I have run into a strange problem. Even if
 the number of rows in Cassandra is just 1M, the Spark job goes seems to go
 into an infinite loop and runs for hours. With a small amount of data (less
 than 100 rows), the job does finish, but takes almost 30-40 seconds and we
 frequently see the messages shown below. If we run the same application on a
 single node Spark (--master local[4]), then we don’t see these warnings and
 the task finishes in less than 6-7 seconds. Any idea what could be the cause
 for these problems when we run our application on a standalone 4-node spark
 cluster?



 14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)

 14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)

 14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)

 14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)

 14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(0, 192.168.222.142, 39342, 0)

 14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)

 14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(0, 192.168.222.142, 39342, 0)

 14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)

 14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)

 14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)

 14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)

 14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)

 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)

 14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)

 14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)

 14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)

 14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:37 WARN TaskSetManager: Lost TID 29093 (task 6.14:0)

 14/06/30 19:30:37 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:39 WARN TaskSetManager: Lost TID 29366 (task 6.15:0)

 14/06/30 19:30:39 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:40 WARN TaskSetManager: Lost TID 29648 (task 6.16:9)

 14/06/30 19:30:40 WARN TaskSetManager: Loss was due to fetch failure from
 BlockManagerId(2, 192.168.222.164, 57185, 0)

 14/06/30 19:30:42 WARN TaskSetManager: Lost TID 29924 (task 

Re: Fw: How Spark Choose Worker Nodes for respective HDFS block

2014-07-01 Thread Chris Fregly
yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL)
where possible just like MapReduce.  it's a best practice to co-locate your
Spark Workers on the same nodes as your HDFS Name Nodes for just this
reason.

this is achieved through the RDD.preferredLocations() interface method:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

on a related note, you can configure spark.locality.wait as the number of
millis to wait before falling back to a less-local data node (RACK_LOCAL):
  http://spark.apache.org/docs/latest/configuration.html

-chris


On Fri, Jun 13, 2014 at 11:06 PM, anishs...@yahoo.co.in 
anishs...@yahoo.co.in wrote:

 Hi All

 Is there any communication between Spark MASTER node and Hadoop NameNode
 while distributing work to WORKER nodes, like we have in MapReduce.

 Please suggest

 TIA

 --
 Anish Sneh
 Experience is the best teacher.
 http://in.linkedin.com/in/anishsneh


  --
 * From: * anishs...@yahoo.co.in anishs...@yahoo.co.in;
 * To: * u...@spark.incubator.apache.org u...@spark.incubator.apache.org;

 * Subject: * How Spark Choose Worker Nodes for respective HDFS block
 * Sent: * Fri, Jun 13, 2014 9:17:50 PM

   Hi All

 I am new to Spark, workin on 3 node test cluster. I am trying to explore
 Spark scope in analytics, my Spark codes interacts with HDFS mostly.

 I have a confusion that how Spark choose on which node it will distribute
 its work.

 Since we assume that it can be an alternative to Hadoop MapReduce. In
 MapReduce we know that internally framework will distribute code (or logic)
 to the nearest TaskTracker which are co-located with DataNode or in same
 rack or probably nearest to the data blocks.

 My confusion is when I give HDFS path inside a Spark program how it choose
 which node is nearest (if it does).

 If it does not then how it will work when I have TBs of data where high
 network latency will be involved.

 My apologies for asking basic question, please suggest.

 TIA
 --
 Anish Sneh
 Experience is the best teacher.
 http://www.anishsneh.com



Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Marco Shaw
They are recorded...  For example, 2013: http://spark-summit.org/2013

I'm assuming the 2014 videos will be up in 1-2 weeks.

Marco


On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 Are these sessions recorded ?


 On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:







 *General Session / Keynotes :
  http://www.ustream.tv/channel/spark-summit-2014
 http://www.ustream.tv/channel/spark-summit-2014 Track A
 : http://www.ustream.tv/channel/track-a1
 http://www.ustream.tv/channel/track-a1Track
 B: http://www.ustream.tv/channel/track-b1
 http://www.ustream.tv/channel/track-b1 Track
 C: http://www.ustream.tv/channel/track-c1
 http://www.ustream.tv/channel/track-c1*


 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com
 wrote:

 I attended yesterday on ustream.tv, but can't find the links to today's
 streams anywhere. help!

 --
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)






Re: spark streaming rate limiting from kafka

2014-07-01 Thread Tobias Pfeiffer
Hi,

On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote:

 * Is there a way to control how far Kafka Dstream can read on
 topic-partition (via offset for example). By setting this to a small
 number, it will force DStream to read less data initially.


Please see the post at
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccaph-c_m2ppurjx-n_tehh0bvqe_6la-rvgtrf1k-lwrmme+...@mail.gmail.com%3E
Kafka's auto.offset.reset parameter may be what you are looking for.

Tobias


Re: Spark 1.0: Unable to Read LZO Compressed File

2014-07-01 Thread Matei Zaharia
I’d suggest asking the IBM Hadoop folks, but my guess is that the library 
cannot be found in /opt/IHC/lib/native/Linux-amd64-64/. Or maybe if this 
exception is happening in your driver program, the driver program’s 
java.library.path doesn’t include this. (SPARK_LIBRARY_PATH from spark-env.sh 
only applies to stuff launched on the clusters).

Matei

On Jul 1, 2014, at 7:15 AM, Uddin, Nasir M. nud...@dtcc.com wrote:

 Dear Spark Users:
  
 Spark 1.0 has been installed as Standalone – But it can’t read any compressed 
 (CMX/Snappy) and Sequence file residing on HDFS (it can read uncompressed 
 files from HDFS). The key notable message is: “Unable to load native-hadoop 
 library…..”. Other related messages are –
  
 Caused by: java.lang.IllegalStateException: Cannot load 
 com.ibm.biginsights.compress.CmxDecompressor without native library! at 
 com.ibm.biginsights.compress.CmxDecompressor.clinit(CmxDecompressor.java:65)
  
 Here is the core-site.xml’s key part:
 nameio.compression.codecs/name
 valueorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec,com.ibm.biginsights.compress.CmxCodec/value
   /property
  
 Here is the spark.env.sh:
 export SPARK_WORKER_CORES=4
 export SPARK_WORKER_MEMORY=10g
 export SCALA_HOME=/opt/spark/scala-2.11.1
 export JAVA_HOME=/opt/spark/jdk1.7.0_55
 export SPARK_HOME=/opt/spark/spark-0.9.1-bin-hadoop2
 export ADD_JARS=/opt/IHC/lib/compression.jar
 export SPARK_CLASSPATH=/opt/IHC/lib/compression.jar
 export SPARK_LIBRARY_PATH=/opt/IHC/lib/native/Linux-amd64-64/
 export SPARK_MASTER_WEBUI_PORT=1080
 export HADOOP_CONF_DIR=/opt/IHC/hadoop-conf
  
 Note: core-site.xml and hdfs-site.xml are in hadoop-conf. CMX is an IBM 
 branded splittable LZO based compression codec.
  
 Any help to resolve the issue is appreciated.
  
 Thanks,
 Nasir
 
 DTCC DISCLAIMER: This email and any files transmitted with it are 
 confidential and intended solely for the use of the individual or entity to 
 whom they are addressed. If you have received this email in error, please 
 notify us immediately and delete the email and any attachments from your 
 system. The recipient should check this email and any attachments for the 
 presence of viruses.  The company accepts no liability for any damage caused 
 by any virus transmitted by this email.



Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Awesome. 

Just want to catch up on some sessions from other tracks. 

Learned a ton over the last two days. 

Thanks 
Soumya 






 On Jul 1, 2014, at 8:50 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Yup, we’re going to try to get the videos up as soon as possible.
 
 Matei
 
 On Jul 1, 2014, at 7:47 PM, Marco Shaw marco.s...@gmail.com wrote:
 
 They are recorded...  For example, 2013: http://spark-summit.org/2013
 
 I'm assuming the 2014 videos will be up in 1-2 weeks.
 
 Marco
 
 
 On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com 
 wrote:
 Are these sessions recorded ? 
 
 
 On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:
 General Session / Keynotes :  
 http://www.ustream.tv/channel/spark-summit-2014
 
 Track A : http://www.ustream.tv/channel/track-a1
 
 Track B: http://www.ustream.tv/channel/track-b1
 
 Track C: http://www.ustream.tv/channel/track-c1
 
 
 On Tue, Jul 1, 2014 at 9:37 AM, Aditya Varun Chadha adic...@gmail.com 
 wrote:
 I attended yesterday on ustream.tv, but can't find the links to today's 
 streams anywhere. help!
 
 -- 
 Aditya Varun Chadha | http://www.adichad.com | +91 81308 02929 (M)
 


Re: org.jboss.netty.channel.ChannelException: Failed to bind to: master/1xx.xx..xx:0

2014-07-01 Thread Aaron Davidson
In your spark-env.sh, do you happen to set SPARK_PUBLIC_DNS or something of
that kin? This error suggests the worker is trying to bind a server on the
master's IP, which clearly doesn't make sense



On Mon, Jun 30, 2014 at 11:59 PM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:

 Hi,

 I did netstat -na | grep 192.168.125.174 and its showing
 192.168.125.174:7077 LISTEN(after starting master)

 I tried to execute the following script from the slaves manually but it
 ends up with the same exception and log.This script is internally executing
 the java command.
  /usr/local/spark-1.0.0/sbin/start-slave.sh 1 spark://192.168.125.174:7077
 In this case netstat is showing any connection established to master:7077.

 When we manually execute the java command,the connection is getting
 established to master.

 Thanks  Regards,
 Meethu M


   On Monday, 30 June 2014 6:38 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:


  Are you sure you have this ip 192.168.125.174 http://192.168.125.174:0/ 
 bind
 for that machine? (netstat -na | grep 192.168.125.174
 http://192.168.125.174:0/)

 Thanks
 Best Regards


 On Mon, Jun 30, 2014 at 5:34 PM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:

 Hi all,

 I reinstalled spark,reboot the system,but still I am not able to start the
 workers.Its throwing the following exception:

 Exception in thread main org.jboss.netty.channel.ChannelException:
 Failed to bind to: master/192.168.125.174:0

 I doubt the problem is with 192.168.125.174:0. Eventhough the command
 contains master:7077,why its showing 0 in the log.

 java -cp
 ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
 org.apache.spark.deploy.worker.Worker spark://master:7077

 Can somebody tell me  a solution.

 Thanks  Regards,
 Meethu M


   On Friday, 27 June 2014 4:28 PM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:


  Hi,
 ya I tried setting another PORT also,but the same problem..
 master is set in etc/hosts

 Thanks  Regards,
 Meethu M


   On Friday, 27 June 2014 3:23 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:


 tha's strange, did you try setting the master port to something else (use
 SPARK_MASTER_PORT).

 Also you said you are able to start it from the java commandline

 java -cp ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/
 assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m
 -Xmx512m org.apache.spark.deploy.worker.Worker spark://:*master*:7077

 What is the master ip specified here? is it like you have entry for
 *master* in the /etc/hosts?

 Thanks
 Best Regards


 On Fri, Jun 27, 2014 at 3:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:

 Hi Akhil,

 I am running it in a LAN itself..The IP of the master is given correctly.

 Thanks  Regards,
 Meethu M


   On Friday, 27 June 2014 2:51 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:


  why is it binding to port 0? 192.168.125.174:0 :/

 Check the ip address of that master machine (ifconfig) looks like the ip
 address has been changed (hoping you are running this machines on a LAN)

 Thanks
 Best Regards


 On Fri, Jun 27, 2014 at 12:00 PM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:

 Hi all,

 My Spark(Standalone mode) was running fine till yesterday.But now I am
 getting  the following exeception when I am running start-slaves.sh or
 start-all.sh

 slave3: failed to launch org.apache.spark.deploy.worker.Worker:
 slave3:   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 slave3:   at java.lang.Thread.run(Thread.java:662)

 The log files has the following lines.

 14/06/27 11:06:30 INFO SecurityManager: Using Spark's default log4j
 profile: org/apache/spark/log4j-defaults.properties
 14/06/27 11:06:30 INFO SecurityManager: Changing view acls to: hduser
 14/06/27 11:06:30 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(hduser)
 14/06/27 11:06:30 INFO Slf4jLogger: Slf4jLogger started
 14/06/27 11:06:30 INFO Remoting: Starting remoting
 Exception in thread main org.jboss.netty.channel.ChannelException:
 Failed to bind to: master/192.168.125.174:0
  at
 org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
 ...
 Caused by: java.net.BindException: Cannot assign requested address
  ...
 I saw the same error reported before and have tried the following
 solutions.

 Set the variable SPARK_LOCAL_IP ,Changed the SPARK_MASTER_PORT to a
 different number..But nothing is working.

 When I try to start the worker from the respective machines using the
 following java command,its running without any exception

 java -cp
 ::/usr/local/spark-1.0.0/conf:/usr/local/spark-1.0.0/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m 

Re: multiple passes in mapPartitions

2014-07-01 Thread Frank Austin Nothaft
Hi Zhen,

The Scala iterator trait supports cloning via the duplicate method 
(http://www.scala-lang.org/api/current/index.html#scala.collection.Iterator@duplicate:(Iterator[A],Iterator[A])).

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Jun 13, 2014, at 9:28 PM, zhen z...@latrobe.edu.au wrote:

 Thank you for your suggestion. We will try it out and see how it performs. We
 think the single call to mapPartitions will be faster but we could be wrong.
 It would be nice to have a clone method on the iterator.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/multiple-passes-in-mapPartitions-tp7555p7616.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Lost TID: Loss was due to fetch failure from BlockManagerId

2014-07-01 Thread Mayur Rustagi
It could be cause you are out of memory on the worker nodes  blocks are
not getting registered..
A older issue with 0.6.0 was with dead nodes causing loss of task  then
resubmission of data in an infinite loop... It was fixed in 0.7.0 though.
Are you seeing a crash log in this log.. or in the worker log @ 192.168.222.164
or any of the machines where the crash log is displayed.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Wed, Jul 2, 2014 at 7:51 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 A lot of things can get funny when you run distributed as opposed to
 local -- e.g. some jar not making it over. Do you see anything of
 interest in the log on the executor machines -- I'm guessing
 192.168.222.152/192.168.222.164. From here

 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
 seems like the warning message is logged after the task fails -- but I
 wonder if you might see something more useful as to why it failed to
 begin with. As an example we've had cases in Hdfs where a small
 example would work, but on a larger example we'd hit a bad file. But
 the executor log is usually pretty explicit as to what happened...

 On Tue, Jul 1, 2014 at 8:57 PM, Mohammed Guller moham...@glassbeam.com
 wrote:
  I am running Spark 1.0 on a 4-node standalone spark cluster (1 master + 3
  worker). Our app is fetching data from Cassandra and doing a basic
 filter,
  map, and countByKey on that data. I have run into a strange problem.
 Even if
  the number of rows in Cassandra is just 1M, the Spark job goes seems to
 go
  into an infinite loop and runs for hours. With a small amount of data
 (less
  than 100 rows), the job does finish, but takes almost 30-40 seconds and
 we
  frequently see the messages shown below. If we run the same application
 on a
  single node Spark (--master local[4]), then we don’t see these warnings
 and
  the task finishes in less than 6-7 seconds. Any idea what could be the
 cause
  for these problems when we run our application on a standalone 4-node
 spark
  cluster?
 
 
 
  14/06/30 19:30:16 WARN TaskSetManager: Lost TID 25036 (task 6.0:90)
 
  14/06/30 19:30:16 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:18 WARN TaskSetManager: Lost TID 25310 (task 6.1:0)
 
  14/06/30 19:30:18 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:19 WARN TaskSetManager: Lost TID 25582 (task 6.2:0)
 
  14/06/30 19:30:19 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:21 WARN TaskSetManager: Lost TID 25882 (task 6.3:34)
 
  14/06/30 19:30:21 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(0, 192.168.222.142, 39342, 0)
 
  14/06/30 19:30:22 WARN TaskSetManager: Lost TID 26152 (task 6.4:0)
 
  14/06/30 19:30:22 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(0, 192.168.222.142, 39342, 0)
 
  14/06/30 19:30:23 WARN TaskSetManager: Lost TID 26427 (task 6.5:4)
 
  14/06/30 19:30:23 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:25 WARN TaskSetManager: Lost TID 26690 (task 6.6:0)
 
  14/06/30 19:30:25 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:26 WARN TaskSetManager: Lost TID 26959 (task 6.7:0)
 
  14/06/30 19:30:26 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:28 WARN TaskSetManager: Lost TID 27449 (task 6.8:218)
 
  14/06/30 19:30:28 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:30 WARN TaskSetManager: Lost TID 27718 (task 6.9:0)
 
  14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:30 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:31 WARN TaskSetManager: Lost TID 27991 (task 6.10:1)
 
  14/06/30 19:30:31 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:33 WARN TaskSetManager: Lost TID 28265 (task 6.11:0)
 
  14/06/30 19:30:33 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:34 WARN TaskSetManager: Lost TID 28550 (task 6.12:0)
 
  14/06/30 19:30:34 WARN TaskSetManager: Loss was due to fetch failure from
  BlockManagerId(2, 192.168.222.164, 57185, 0)
 
  14/06/30 19:30:36 WARN TaskSetManager: Lost TID 28822 (task 6.13:0)
 
  14/06/30 19:30:36 WARN TaskSetManager: Loss was due to fetch failure from