You can try spark on Mesos or Yarn since they have lot more support for
scheduling and all
Thanks
Best Regards
On Thu, Sep 25, 2014 at 4:50 AM, Subacini B subac...@gmail.com wrote:
hi All,
How to run concurrently multiple requests on same cluster.
I have a program using *spark streaming
Looks like pyspark was not able to find the python binaries from the
environment. You need to install python
https://docs.python.org/2/faq/windows.html (if not installed already).
Thanks
Best Regards
On Thu, Sep 25, 2014 at 9:00 AM, Denny Lee denny.g@gmail.com wrote:
This seems similar to
Hello again spark users and developers!
I have standalone spark cluster (1.1.0) and spark sql running on it. My
cluster consists of 4 datanodes and replication factor of files is 3.
I use thrift server to access spark sql and have 1 table with 30+
partitions. When I run query on whole table
I have encountered the same issue when I went through the tutorial first
standalone application. Then I tried to reinstall the stb but it doest help.
Then I follow this thread, create a workspace under spark directly and
execute ./sbt/sbt package, it says packing successfully. But how this
7000x7000 is not tall-and-skinny matrix. Storing the dense matrix
requires 784MB. The driver needs more storage for collecting result
from executors as well as making a copy for LAPACK's dgesvd. So you
need more memory. Do you need the full SVD? If not, try to use a small
k, e.g, 50. -Xiangrui
On
For the vectorizer, what's the output feature dimension and are you
creating sparse vectors or dense vectors? The model on the driver
consists of numClasses * numFeatures doubles. However, the driver
needs more memory in order to receive the task result (of the same
size) from executors. So you
Hi Raghuveer,
This might be a better question for the cdh-user list or the Hadoop user
list. The Hadoop web interfaces for both the NameNode and ResourceManager
are enabled by default. Is it possible you have a firewall blocking those
ports?
-Sandy
On Wed, Sep 24, 2014 at 9:00 PM, Raghuveer
I encountered exactly the same problem. How did you solve this?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/quick-start-guide-building-a-standalone-scala-program-tp3116p15125.html
Sent from the Apache Spark User List mailing list archive at
Hi
We've just experienced an issue with the new Spark-1.1.0 and the start-thriftserver.sh script.
We tried to launch start-thriftserver.sh with --master yarn option and got the following error message :
Failed to load Hive Thrift server main class
There are two problems you may be facing.
1. your application is taking all resources
2. inside your application task submission is not scheduling properly.
for 1 you can either configure your app to take less resources or use
mesos/yarn types scheduler to dynamically change or juggle resources
Hi, I found the problem.
By default, gmond is monitoring the multicast ip:239.2.11.71, while I set
*.sink.ganglia.host=localhost.
the correct configuration in metrics.properties:
# Enable GangliaSink for all instances
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
ENV:
Spark:0.9.0-incubating
Hadoop:2.3.0
I run spark task on Yarn. I see the log in Nodemanager:
2014-09-25 17:43:34,141 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 549 for container-id
ENV:
Spark:0.9.0-incubating
Hadoop:2.3.0
I run spark task on Yarn. I see the log in Nodemanager:
2014-09-25 17:43:34,141 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 549 for container-id
My yarn-site.xml config:
property
nameyarn.nodemanager.resource.memory-mb/name
value16384/value
/property
ENV:
Spark:0.9.0-incubating
Hadoop:2.3.0
I run spark task on Yarn. I see the log in Nodemanager:
2014-09-25 17:43:34,141 INFO
I had similar problem writing to cassandra using the connector for cassandra.
I am not sure whether this will work or not but I reduced the number of
cores to 1 per machine and my job was stable. More explanation of my
issue...
I work with spark on unstable cluster with bad administration.
I started get
14/09/25 15:29:56 ERROR storage.DiskBlockObjectWriter: Uncaught
exception while reverting partial writes to file
We build some SPARK jobs with external jars. I compile jobs by including them
in one assembly.
But look for an approach to put all external jars into HDFS.
We have already put spark jar in a HDFS folder and set up the variable
SPARK_JAR.
What is the best way to do that for other external jars
You should check the log of resource manager when you submit this job to yarn.
It will be recorded how many resources your spark application actually asked
from resource manager for each container.
Did you use fair scheduler?
there is a config parameter of fair scheduler
I update the spark version form 1.02 to 1.10 , experienced an snappy version
issue with the new Spark-1.1.0. After update the glibc version, occured a
another issue. I abstract the log as follows:
14/09/25 11:29:18 WARN [org.apache.hadoop.util.NativeCodeLoader---main]:
Unable to load
This is a question on using the Pregel function in GraphX. Does a message
get serialized and then de-serialized in the scenario where both the source
and the destination vertices are in the same compute node/machine?
Thank you!
--
View this message in context:
I experience spark streaming restart issues similar to what is discussed in
the 2 threads below (in which I failed to find a solution). Could anybody
let me know if anything is wrong in the way I start/stop or if this could be
a spark bug?
Sorry for missing your original email - thanks for the catch, eh?!
On Thu, Sep 25, 2014 at 7:14 AM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
Fixed the issue by downgrade hive from 13.1 to 12.0, it works well now.
Regards
On 31 Aug, 2014, at 7:28 am,
SparkContext.addJar()?
Why you didn't like fat jar way?
2014-09-25 16:25 GMT+04:00 rzykov rzy...@gmail.com:
We build some SPARK jobs with external jars. I compile jobs by including
them
in one assembly.
But look for an approach to put all external jars into HDFS.
We have already put
Hi all,
I tried to run my spark job on yarn.
In my application, I need to call third-parity jni libraries in a spark job.
However, I can't find a way to make spark job load my native libraries.
Is there anyone who knows how to solve this problem?
Thanks.
Ziv Huang
--
View this message in
Hmmm, you might be suffering from SPARK-1719.
Not sure what the proper workaround is, but it sounds like your native
libs are not in any of the standard lib directories; one workaround
might be to copy them there, or add their location to /etc/ld.so.conf
(I'm assuming Linux).
On Thu, Sep 25,
Thanks, Yanbo and Nicholas. Now it makes more sense — query optimization is the
answer. /Du
From: Nicholas Chammas
nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com
Date: Thursday, September 25, 2014 at 6:43 AM
To: Yanbo Liang yanboha...@gmail.commailto:yanboha...@gmail.com
Cc: Du Li
I suppose I have other problems as I can’t get the Scala example to work
either. Puzzling, as I have literally coded like the examples (that are
purported to work), but no luck.
mn
On Sep 24, 2014, at 11:27 AM, Tim Smith secs...@gmail.com wrote:
Maybe differences between JavaPairDStream and
Hi all
VertexRDD is partitioned with HashPartitioner, and it exhibits some
imbalance of tasks.
For example, Connected Components with partition strategy Edge2D:
Aggregated Metrics by Executor
Executor ID Task Time Total Tasks Failed Tasks Succeeded Tasks
Input Shuffle Read
Hi,
Anybody using LZOP files to process in Spark?
We have a huge volume of LZOP files in HDFS to process through Spark. In
MapReduce framework, it automatically detects the file format and sends the
decompressed version to Mappers.
Any such support in Spark?
As of now I am manually downloading,
Hi Liquan,
Thanks. I was running this in spark-shell. I was able to resolve this issue by
creating an app and then submitting it via spark-submit in yarn-client mode. I
have seen this happening before as well -- submitting via spark-shell has
memory issues. The same code then works fine when
Hello,
A bit of a background.
I have a dataset with about 200 million records and around 10 columns. The size
of this dataset is around 1.5Tb and is split into around 600 files.
When I read this dataset, using sparkContext, by default it creates around 3000
partitions if I do not specify the
At 2014-09-25 06:52:46 -0700, Cheuk Lam chl...@hotmail.com wrote:
This is a question on using the Pregel function in GraphX. Does a message
get serialized and then de-serialized in the scenario where both the source
and the destination vertices are in the same compute node/machine?
Yes,
Then I think it's time for you to look at the Spark Master logs...
On Thu, Sep 25, 2014 at 7:51 AM, danilopds danilob...@gmail.com wrote:
Hi Marcelo,
Yes, I can ping spark-01 and I also include the IP and host in my file
/etc/hosts.
My VM can ping the local machine too.
--
View this
Hi Team,
Can I use Actors in Spark Streaming based on events type? Could you please
review below Test program and let me know if any thing I need to change
with respect to best practices
import akka.actor.Actor
import akka.actor.{ActorRef, Props}
import org.apache.spark.SparkConf
import
On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote:
I am running spark with the default settings in yarn client mode. For some
reason yarn always allocates three containers to the application (wondering
where it is set?), and only uses two of them.
The default number of
Hi Sameer,
When starting spark-shell, by default, the JVM for spark-shell only have
512M memory. For a quick hack, you can use
SPARK_MEM=4g bin/spark-shell to set JVM memory to be 4g. For more
information, you can refer
http://spark.apache.org/docs/latest/cluster-overview.html
Thanks,
Liquan
On
You can pass the HDFS location of those extra jars in the spark-submit
--jars argument. Spark will take care of using Yarn's distributed
cache to make them available to the executors. Note that you may need
to provide the full hdfs URL (not just the path, since that will be
interpreted as a local
Hi Davies,
Thanks for your help.
I ultimately re-wrote the code to use broadcast variables, and then
received an error when trying to broadcast self.all_models that the size
did not fit in an int (recall that broadcasts use 32 bit ints to store
size), suggesting that it was in fact over 2G. I
I followed linked JIRAs to HDFS-7005 which is in hadoop 2.6.0
Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ?
On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Looks like it's a HDFS issue, pretty new.
Tim,
I think I understand this now. I had a five node Spark cluster and a five
partition topic, and I created five receivers. I found this:
http://stackoverflow.com/questions/25785581/custom-receiver-stalls-worker-in-spark-streaming
Indicating that if I use all my workers as receivers,
Please add the Apache Spark Maryland meetup to the Spark website.
http://www.meetup.com/Apache-Spark-Maryland
Thanks!
Brian
*Brian Husted*
*Tetra Concepts, LLC*
tetraconcepts.com
*301.518.6994 (c)*
*866.618.1343 (f)*
Hi Deep,
I believe that you are referring to the map for Iterable[String]
suppose you have
iter:Iterable[String]
you can do
newIter = iter.map(item = Item + a )
which will create an new Iterable[String] with each element appending an
a to all string in iter.
Does this answer your question?
Additionally,
If I dial up/down the number of executor cores, this does what I want. Thanks
for the extra eyes!
mn
On Sep 25, 2014, at 12:34 PM, Matt Narrell matt.narr...@gmail.com wrote:
Tim,
I think I understand this now. I had a five node Spark cluster and a five
partition topic,
What is the size of your vector mine is set to 20? I am seeing slow results
as well with iteration=5, # of elements 200,000,000.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/K-means-faster-on-Mahout-then-on-Spark-tp3195p15168.html
Sent from the
Hi,
Details laid out in Spark UI for the job in progress is really interesting
and very useful.
But this gets vanished once the job is done.
Is there a way to get job details post processing?
Looking for Spark UI data, not standard input,output and error info.
Thanks,
Harsha
Hi everyone,
I need some advice about how to make the following: having a RDD of vectors
(each vector being Vector(Int, Int , Int, int)), I need to group the data,
then I need to apply a function to every group comparing each consecutive
item within a group and retaining a variable (that has to
Hi. We are testing Spark streaming. Its looks awesome!
We are trying to figure how to submit a new version of a live forever job.
We have a job that streams metrics of a bunch of servers applying
transformations like .reduceByWindow and then stores the results in hdfs.
If we submit this new
I posted yesterday about a related issue but resolved it shortly after. I'm
using Spark Streaming to summarize event data from Kafka and save it to a
MySQL table. Currently the bottleneck is in writing to MySQL and I'm puzzled
as to how to speed it up. I've tried repartitioning with several
We're running into an error (below) when trying to read spilled shuffle
data back in.
Has anybody encountered this before / is anybody familiar with what causes
these Kryo UnsupportedOperationExceptions?
any guidance appreciated,
Sandy
---
com.esotericsoftware.kryo.KryoException
Thank you.
Where is the number of containers set?
On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote:
I am running spark with the default settings in yarn client mode. For some
reason yarn always
From spark-submit --help:
YARN-only:
--executor-cores NUMNumber of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: default).
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES
On Thu, Sep 25, 2014 at 11:25 AM, Brad Miller
bmill...@eecs.berkeley.edu wrote:
Hi Davies,
Thanks for your help.
I ultimately re-wrote the code to use broadcast variables, and then received
an error when trying to broadcast self.all_models that the size did not fit
in an int (recall that
Hi Xianguri,
After setting SVD to smaller value (200) its working.
Thanks,
Shailesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15179.html
Sent from the Apache Spark User List
Update for posterity, so once again I solved the problem shortly after
posting to the mailing list. So updateStateByKey uses the default
partitioner, which in my case seemed like it was set to one.
Changing my call from .updateStateByKey[Long](updateFn) -
.updateStateByKey[Long](updateFn,
I would guess the field serializer is having issues being able to
reconstruct the class again, its pretty much best effort.
Is this an intermediate type?
On Thu, Sep 25, 2014 at 2:12 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
We're running into an error (below) when trying to read spilled
Thanks for the update.. I'm interested in writing the results to MySQL as
well, can you share some light or code sample on how you setup the
driver/connection pool/etc.?
On Thu, Sep 25, 2014 at 4:00 PM, maddenpj madde...@gmail.com wrote:
Update for posterity, so once again I solved the problem
Hi,
I am using Spark 1.1.0 on a cluster. My job takes as input 30 files in a
directory (I am using sc.textfile(dir/*) ) to read in the files. I am
getting the following warning:
WARN TaskSetManager: Lost task 99.0 in stage 1.0 (TID 99,
mesos12-dev.sccps.net): java.io.FileNotFoundException:
Hi
I am running into trouble using iPython notebook on my cluster. Use the
following command to set the cluster up
$ ./spark-ec2 --key-pair=$KEY_PAIR --identity-file=$KEY_FILE
--region=$REGION --slaves=$NUM_SLAVES launch $CLUSTER_NAME
On master I launch python as follows
$
Hi again!
At the moment I try to use parquet and I want to keep the data into the
memory in an efficient way to make requests against the data as fast as
possible.
I read about parquet it is able to encode nested columns. Parquet uses the
Dremel encoding with definition and repetition levels.
Is
Please also check the load balance of the RDD on YARN. How many
partitions are you using? Does it match the number of CPU cores?
-Xiangrui
On Thu, Sep 25, 2014 at 12:28 PM, bhusted brian.hus...@gmail.com wrote:
What is the size of your vector mine is set to 20? I am seeing slow results
as well
hi all,
I am getting this strange error about half way through the job (running
spark 1.1 on yarn client mode):
14/09/26 00:54:06 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@4d0155fb
java.nio.channels.CancelledKeyException
at
thanks.
On Thu, Sep 25, 2014 at 10:25 PM, Marcelo Vanzin [via Apache Spark
User List] ml-node+s1001560n15177...@n3.nabble.com wrote:
From spark-submit --help:
YARN-only:
--executor-cores NUMNumber of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to
Good to know it worked out and thanks for the update. I didn't realize
you need to provision for receiver workers + processing workers. One
would think a worker would process multiple stages of an app/job and
receive is just a stage of the job.
On Thu, Sep 25, 2014 at 12:05 PM, Matt Narrell
Thanks Yi Tian!
Yes, I use fair scheduler.
In resource manager log. I see the container's start shell:
/home/export/Data/hadoop/tmp/nm-local-dir/usercache/hpc/appcache/application_1411693809133_0002/container_1411693809133_0002_01_02/launch_container.sh
In the end:
exec /bin/bash -c
Hi all
My code is as follows:
/usr/local/webserver/sparkhive/bin/spark-submit
--class org.apache.spark.examples.streaming.FlumeEventCount
--master yarn
--deploy-mode cluster
--queue online
--num-executors 5
--driver-memory 6g
--executor-memory 20g
--executor-cores 5
You're right, I'm suffering from SPARK-1719.
I've tried to add their location to /etc/ld.so.conf and I've submitted my
job as a yarn-client,
but the problem is the same: my native libraries are not loaded.
Does this method work in your case?
--
View this message in context:
Hi SK,
For the problem with lots of shuffle files and the too many open files
exception there are a couple options:
1. The linux kernel has a limit on the number of open files at once. This
is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or
/etc/sysctl.d/. Try increasing
Matt you should be able to set an HDFS path so you'll get logs written to a
unified place instead of to local disk on a random box on the cluster.
On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell matt.narr...@gmail.com
wrote:
How does this work with a cluster manager like YARN?
mn
On Sep 25,
Hi Vinay,
What I'm guessing is happening is that Spark is taking the locality of
files into account and you don't have node-local data on all your
machines. This might be the case if you're reading out of HDFS and your
600 files are somehow skewed to only be on about 200 of your 400 machines.
A
Hi Harsha,
I use LZOP files extensively on my Spark cluster -- see my writeup for how
to do this on this mailing list post:
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAOoZ679ehwvT1g8=qHd2n11Z4EXOBJkP+q=Aj0qE_=shhyl...@mail.gmail.com%3E
Maybe we should better document how
Hi Christy,
I'm more of a Gradle fan but I know SBT fits better into the Scala
ecosystem as a build tool. If you'd like to give Gradle a shot try this
skeleton Gradle+Spark repo from my coworker Punya.
https://github.com/punya/spark-gradle-test-example
Good luck!
Andrew
On Thu, Sep 25, 2014
Hi Alexey,
You should see in the logs a locality measure like NODE_LOCAL,
PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node
on them and you're reading out of HDFS, then you should be seeing almost
all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark
Yup it's all in the gist:
https://gist.github.com/maddenpj/5032c76aeb330371a6e6
Lines 6-9 deal with setting up the driver specifically. This sets the driver
up on each partition which keeps the connection pool around per record.
--
View this message in context:
Maybe you have Python 2.7 on master but Python 2.6 in cluster,
you should upgrade python to 2.7 in cluster, or use python 2.6 in
master by set PYSPARK_PYTHON=python2.6
On Thu, Sep 25, 2014 at 5:11 PM, Andy Davidson
a...@santacruzintegration.com wrote:
Hi
I am running into trouble using iPython
Hi All,
I have a java program which submits a spark job to a standalone spark
cluster (2 nodes; 10 cores (6+4); 12GB (8+4)). This is being called by
another java program through ExecutorService and invokes it multiple times
with different set of arguments and parameters. I have set spark memory
Hello all,
I have some questions regarding the foreachRDD output function in Spark
Streaming.
The programming guide (
http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html)
describes how to output data using network connection on the worker nodes.
Are there some working examples
The problem is solved, the web interfaces are not opening in local network
connecting to server with proxy its opening only in the servers without
proxy ..
On Thu, Sep 25, 2014 at 1:12 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
Hi Raghuveer,
This might be a better question for the cdh-user
I built a patched DFSClient jar and now testing (takes 3 hours...)
I'd like to know if I can patch spark builds? How about just replace
DFSClient.class in spark-assembly jar?
Jianshi
On Fri, Sep 26, 2014 at 2:29 AM, Ted Yu yuzhih...@gmail.com wrote:
I followed linked JIRAs to HDFS-7005 which
78 matches
Mail list logo