Re: Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread Matei Zaharia
BTW I should add that one other thing that would help MLlib locally would be doing model updates in batches. That is, instead of operating on one point at a time, group together a bunch of them and apply a matrix operation, which will allow more efficient use of BLAS or other linear algebra prim

Re: Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread Matei Zaharia
These numbers are from GPUs and Intel MKL (a closed-source math library for Intel processors), where for CPU-bound algorithms you are going to get faster speeds than MLlib's JBLAS. However, there's in theory nothing preventing the use of these in MLlib (e.g. if you have a faster BLAS locally; ad

Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread DB Tsai
BIDMach is CPU and GPU-accelerated Machine Learning Library also from Berkeley. https://github.com/BIDData/BIDMach/wiki/Benchmarks They did benchmark against Spark 0.9, and they claimed that it's significantly faster than Spark MLlib. In Spark 1.0, lot of performance optimization had been done, a

Re: streaming sequence files?

2014-07-26 Thread Tathagata Das
Which deployment environment are you running the streaming programs? Standalone? In that case you have to specify what is the max cores for each application, other all the cluster resources may get consumed by the application. http://spark.apache.org/docs/latest/spark-standalone.html TD On Thu, J

Re: java.lang.StackOverflowError when calling count()

2014-07-26 Thread Tathagata Das
Responses inline. On Wed, Jul 23, 2014 at 4:13 AM, lalit1303 wrote: > Hi, > Thanks TD for your reply. I am still not able to resolve the problem for my > use case. > I have let's say 1000 different RDD's, and I am applying a transformation > function on each RDD and I want the output of all rdd's

Re: SparkContext startup time out

2014-07-26 Thread Anand Avati
I am bumping into this problem as well. I am trying to move to akka 2.3.x from 2.2.x in order to port to Scala 2.11 - only akka 2.3.x is available in Scala 2.11. All 2.2.x akka works fine, and all 2.3.x akka give the following exception in "new SparkContext". Still investigating why.. java.util.

Re: streaming window not behaving as advertised (v1.0.1)

2014-07-26 Thread Tathagata Das
Yeah, maybe I should bump the issue to major. Now that I thought about to give my previous answer, this should be easy to fix just by doing a foreachRDD on all the input streams within the system (rather than explicitly doing it like I asked you to do). Thanks Alan, for testing this out and confir

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
Even in local mode, Spark serializes data that would be sent across the network, e.g. in a reduce operation, so that you can catch errors that would happen in distributed mode. You can make serialization much faster by using the Kryo serializer; see http://spark.apache.org/docs/latest/tuning.htm

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
Even in local mode, Spark serializes data that would be sent across the network, e.g. in a reduce operation, so that you can catch errors that would happen in distributed mode. You can make serialization much faster by using the Kryo serializer; see http://spark.apache.org/docs/latest/tuning.htm

Re: "spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-26 Thread Tathagata Das
Yeah, I wrote those lines a while back, I wanted to contrast storage levels with and without serialization. Should have realized that StorageLevel.MEMORY_ONLY_SER can be confused to be the default level. TD On Wed, Jul 23, 2014 at 5:12 AM, Shao, Saisai wrote: > Yeah, the document may not be prec

Re: Help using streaming from Spark Shell

2014-07-26 Thread Tathagata Das
You could also set the property using plain old System.setProperty("spark.cleaner.ttl", "3600") before creating the StreamingContext in the spark shell TD On Sat, Jul 26, 2014 at 7:50 AM, Yana Kadiyska wrote: > Hi, > > I'm starting spark-shell like this: > > SPARK_MEM=1g SPARK_JAVA_OPTS="-Dsp

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-26 Thread lokesh.gidra
Thanks for the reply. I understand this now. But in another situation, when I use large heap size to avoid any spilling (I confirm, there are no spilling messages in log), I still see a lot of time being spent in writeObject0() function. Can you please tell me why would there be any serialization

Re: "Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-26 Thread Matei Zaharia
These messages are actually not about spilling the RDD, they're about spilling intermediate state in a reduceByKey, groupBy or other operation whose state doesn't fit in memory. We have to do that in these cases to avoid going out of memory. You can minimize spilling by having more reduce tasks

Re: graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
never mind I think its just the GC taking its time while I got many gigabytes of unused cached rdds that I cannot get rid of easily On Jul 26, 2014 4:44 PM, "Koert Kuipers" wrote: > i have graphx queries running inside a service where i collect the results > to the driver and do not hold any refe

graphx cached partitions wont go away

2014-07-26 Thread Koert Kuipers
i have graphx queries running inside a service where i collect the results to the driver and do not hold any references to the rdds involved in the queries. my assumption was that with the references gone spark would go and remove the cached rdds from memory (note, i did not cache them, graphx did)

"Spilling in-memory..." messages in log even with MEMORY_ONLY

2014-07-26 Thread lokesh.gidra
Hello, I am running SparkPageRank example which uses cache() API for persistence. This AFAIK, uses MEMORY_ONLY storage level. But even in this setup, I see a lot of "WARN ExternalAppendOnlyMap: Spilling in-memory map of" messages in the log. Why is it so? I thought that MEMORY_ONLY means kick

Lot of object serialization even with MEMORY_ONLY

2014-07-26 Thread lokesh.gidra
Hello, I am executing the SparkPageRank example. It uses the "cache()" API for persistence of RDDs. And if I am not wrong, it in turn uses MEMORY_ONLY storage level. However, in oprofile report it shows a lot of activity in writeObject0 function. There is not even a single "Spilling in-memory..."

Re: SparkSQL extensions

2014-07-26 Thread Michael Armbrust
A very simple example of adding a new operator to Spark SQL: https://github.com/apache/spark/pull/1366 An example of adding a new type of join to Spark SQL: https://github.com/apache/spark/pull/837 Basically, you will need to add a new physical operator that inherits from SparkPlan and a Strategy

Help using streaming from Spark Shell

2014-07-26 Thread Yana Kadiyska
Hi, I'm starting spark-shell like this: SPARK_MEM=1g SPARK_JAVA_OPTS="-Dspark.cleaner.ttl=3600" /spark/bin/spark-shell -c 3 but when I try to create a streaming context val scc = new StreamingContext(sc, Seconds(10)) I get: org.apache.spark.SparkException: Spark Streaming cannot be used witho

Exception : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table

2014-07-26 Thread Bilna Govind
> > Hi, > > This is my first code in shark 0.9.1. I am new to spark and shark. So I > don't know where I went wrong. It will be really helpful, If some one out > there can troubleshoot the problem. > First of all I will give a glimpse on my code which is developed in > IntellijIdea. This code i

Fwd: Exception : org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table

2014-07-26 Thread Bilna Govind
> > Hi, > > This is my first code in shark 0.9.1. I am new to spark and shark. So I > don't know where I went wrong. It will be really helpful, If some one out > there can troubleshoot the problem. > First of all I will give a glimpse on my code which is developed in > IntellijIdea. This code i

How can I integrate spark cluster into my own program without using spark-submit?

2014-07-26 Thread Lizhengbing (bing, BIPA)
I want to use spark cluster through a scala function. So I can integrate spark into my program directly. For example: When I call count function in my own program, my program will deploy the function to the cluster , so I can get the result directly def count()= { val master = "spark://ma

Re: Emacs Setup Anyone?

2014-07-26 Thread Prashant Sharma
Normally any setup that has inferior mode for scala repl will also support spark repl (with little or no modifications). Apart from that I personally use spark repl "normally" by invoking spark-shell in a shell in emacs, and I keep the scala tags(etags) for the spark loaded. With this setup it is

Re: Hadoop client protocol mismatch with spark 1.0.1, cdh3u5

2014-07-26 Thread Bharath Ravi Kumar
(For the benefit of other users) The workaround appears to be building spark for the exact Hadoop version and building the app with spark as a provided dependency + without the hadoop-client as a direct dependency of the app. With that, hdfs access works just fine. On Fri, Jul 25, 2014 at 11:50

SparkSQL extensions

2014-07-26 Thread Christos Kozanitis
Hello I was wondering is it easy for you guys to point me to what modules I need to update if I had to add extra functionality to sparkSQL? I was thinking to implement a region-join operator and I guess I should add the implementation details under joins.scala but what else do I need to modify?

Re: Spark Function setup and cleanup

2014-07-26 Thread Sean Owen
Look at mapPartitions. Where as map turns one value V1 into one value V2, mapPartitions lets you turn one entire Iterator[V1] to one whole Iterator [V2]. The function that does so can perform some initialization at its start, and then process all of the values, and clean up at its end. This is how

Re: Spark Function setup and cleanup

2014-07-26 Thread Yosi Botzer
Thank you, but that doesn't answer my general question. I might need to enrich my records using different datasources (or DB's) So the general use case I need to support is to have some kind of Function that has init() logic for creating connection to DB, query the DB for each records and enrich