BTW I should add that one other thing that would help MLlib locally would be
doing model updates in batches. That is, instead of operating on one point at a
time, group together a bunch of them and apply a matrix operation, which will
allow more efficient use of BLAS or other linear algebra prim
These numbers are from GPUs and Intel MKL (a closed-source math library for
Intel processors), where for CPU-bound algorithms you are going to get faster
speeds than MLlib's JBLAS. However, there's in theory nothing preventing the
use of these in MLlib (e.g. if you have a faster BLAS locally; ad
BIDMach is CPU and GPU-accelerated Machine Learning Library also from Berkeley.
https://github.com/BIDData/BIDMach/wiki/Benchmarks
They did benchmark against Spark 0.9, and they claimed that it's
significantly faster than Spark MLlib. In Spark 1.0, lot of
performance optimization had been done, a
Which deployment environment are you running the streaming programs?
Standalone? In that case you have to specify what is the max cores for
each application, other all the cluster resources may get consumed by
the application.
http://spark.apache.org/docs/latest/spark-standalone.html
TD
On Thu, J
Responses inline.
On Wed, Jul 23, 2014 at 4:13 AM, lalit1303 wrote:
> Hi,
> Thanks TD for your reply. I am still not able to resolve the problem for my
> use case.
> I have let's say 1000 different RDD's, and I am applying a transformation
> function on each RDD and I want the output of all rdd's
I am bumping into this problem as well. I am trying to move to akka 2.3.x
from 2.2.x in order to port to Scala 2.11 - only akka 2.3.x is available in
Scala 2.11. All 2.2.x akka works fine, and all 2.3.x akka give the
following exception in "new SparkContext". Still investigating why..
java.util.
Yeah, maybe I should bump the issue to major. Now that I thought about
to give my previous answer, this should be easy to fix just by doing a
foreachRDD on all the input streams within the system (rather than
explicitly doing it like I asked you to do).
Thanks Alan, for testing this out and confir
Even in local mode, Spark serializes data that would be sent across the
network, e.g. in a reduce operation, so that you can catch errors that would
happen in distributed mode. You can make serialization much faster by using the
Kryo serializer; see http://spark.apache.org/docs/latest/tuning.htm
Even in local mode, Spark serializes data that would be sent across the
network, e.g. in a reduce operation, so that you can catch errors that would
happen in distributed mode. You can make serialization much faster by using the
Kryo serializer; see http://spark.apache.org/docs/latest/tuning.htm
Yeah, I wrote those lines a while back, I wanted to contrast storage
levels with and without serialization. Should have realized that
StorageLevel.MEMORY_ONLY_SER can be confused to be the default level.
TD
On Wed, Jul 23, 2014 at 5:12 AM, Shao, Saisai wrote:
> Yeah, the document may not be prec
You could also set the property using plain old
System.setProperty("spark.cleaner.ttl", "3600")
before creating the StreamingContext in the spark shell
TD
On Sat, Jul 26, 2014 at 7:50 AM, Yana Kadiyska wrote:
> Hi,
>
> I'm starting spark-shell like this:
>
> SPARK_MEM=1g SPARK_JAVA_OPTS="-Dsp
Thanks for the reply. I understand this now.
But in another situation, when I use large heap size to avoid any spilling
(I confirm, there are no spilling messages in log), I still see a lot of
time being spent in writeObject0() function. Can you please tell me why
would there be any serialization
These messages are actually not about spilling the RDD, they're about spilling
intermediate state in a reduceByKey, groupBy or other operation whose state
doesn't fit in memory. We have to do that in these cases to avoid going out of
memory. You can minimize spilling by having more reduce tasks
never mind I think its just the GC taking its time while I got many
gigabytes of unused cached rdds that I cannot get rid of easily
On Jul 26, 2014 4:44 PM, "Koert Kuipers" wrote:
> i have graphx queries running inside a service where i collect the results
> to the driver and do not hold any refe
i have graphx queries running inside a service where i collect the results
to the driver and do not hold any references to the rdds involved in the
queries. my assumption was that with the references gone spark would go and
remove the cached rdds from memory (note, i did not cache them, graphx did)
Hello,
I am running SparkPageRank example which uses cache() API for persistence.
This AFAIK, uses MEMORY_ONLY storage level. But even in this setup, I see a
lot of "WARN ExternalAppendOnlyMap: Spilling in-memory map of" messages
in the log. Why is it so? I thought that MEMORY_ONLY means kick
Hello,
I am executing the SparkPageRank example. It uses the "cache()" API for
persistence of RDDs. And if I am not wrong, it in turn uses MEMORY_ONLY
storage level. However, in oprofile report it shows a lot of activity in
writeObject0 function.
There is not even a single "Spilling in-memory..."
A very simple example of adding a new operator to Spark SQL:
https://github.com/apache/spark/pull/1366
An example of adding a new type of join to Spark SQL:
https://github.com/apache/spark/pull/837
Basically, you will need to add a new physical operator that inherits from
SparkPlan and a Strategy
Hi,
I'm starting spark-shell like this:
SPARK_MEM=1g SPARK_JAVA_OPTS="-Dspark.cleaner.ttl=3600"
/spark/bin/spark-shell -c 3
but when I try to create a streaming context
val scc = new StreamingContext(sc, Seconds(10))
I get:
org.apache.spark.SparkException: Spark Streaming cannot be used
witho
>
> Hi,
>
> This is my first code in shark 0.9.1. I am new to spark and shark. So I
> don't know where I went wrong. It will be really helpful, If some one out
> there can troubleshoot the problem.
> First of all I will give a glimpse on my code which is developed in
> IntellijIdea. This code i
>
> Hi,
>
> This is my first code in shark 0.9.1. I am new to spark and shark. So I
> don't know where I went wrong. It will be really helpful, If some one out
> there can troubleshoot the problem.
> First of all I will give a glimpse on my code which is developed in
> IntellijIdea. This code i
I want to use spark cluster through a scala function. So I can integrate spark
into my program directly.
For example:
When I call count function in my own program, my program will deploy the
function to the cluster , so I can get the result directly
def count()=
{
val master = "spark://ma
Normally any setup that has inferior mode for scala repl will also support
spark repl (with little or no modifications).
Apart from that I personally use spark repl "normally" by invoking
spark-shell in a shell in emacs, and I keep the scala tags(etags) for the
spark loaded. With this setup it is
(For the benefit of other users)
The workaround appears to be building spark for the exact Hadoop version
and building the app with spark as a provided dependency + without the
hadoop-client as a direct dependency of the app. With that, hdfs access
works just fine.
On Fri, Jul 25, 2014 at 11:50
Hello
I was wondering is it easy for you guys to point me to what modules I need
to update if I had to add extra functionality to sparkSQL?
I was thinking to implement a region-join operator and I guess I should add
the implementation details under joins.scala but what else do I need to
modify?
Look at mapPartitions. Where as map turns one value V1 into one value
V2, mapPartitions lets you turn one entire Iterator[V1] to one whole
Iterator [V2]. The function that does so can perform some
initialization at its start, and then process all of the values, and
clean up at its end. This is how
Thank you, but that doesn't answer my general question.
I might need to enrich my records using different datasources (or DB's)
So the general use case I need to support is to have some kind of Function
that has init() logic for creating connection to DB, query the DB for each
records and enrich
27 matches
Mail list logo