thanks very much, seems working...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p14870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
On Tue, Sep 23, 2014 at 1:58 AM, myasuka myas...@live.com wrote:
Thus I want to know why recommend
2-3 tasks per CPU core?
You want at least 1 task per core so that you fully utilize the cluster's
parallelism.
You want 2-3 tasks per core so that tasks are a bit smaller than they would
Hi,
Please help me with that.
BR,
Theodore Si
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hello folks,
I have a Spark Streaming application built with Maven (as jar) and deployed
with the spark-submit script. The application project has the following
(main) structure:
myApp
src
main
scala
com.mycompany.package
MyApp.scala
DoSomething.scala
...
resources
aPerlScript.pl
...
Xiangrui,
Yes, the total number of terms is 43839. I have also tried running it using
different values of parallelism ranging from 1/core to 10/core. I also used
multiple configurations like setting spark.storage.memoryFaction and
spark.shuffle.memoryFraction to default values. The point to note
I get the following stacktrace if it is of any help.
14/09/23 15:46:02 INFO scheduler.DAGScheduler: failed: Set()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Missing parents for Stage 7:
List()
14/09/23 15:46:02 INFO scheduler.DAGScheduler: Submitting Stage 7
(MapPartitionsRDD[24] at
Hi,
I am using spark 1.0.0. In my spark code i m trying to persist an rdd to
disk as rrd.persist(DISK_ONLY). But unfortunately couldn't find the
location where the rdd has been written to disk. I specified
SPARK_LOCAL_DIRS and SPARK_WORKER_DIR to some other location rather than
using the default
Hi TD,
This is actually an important requirement (recovery of shared variables) for
us as we need to spread some referential data across the Spark nodes on
application startup. I just bumped into this issue on Spark version 1.0.1. I
assume the latest one also doesn't include this capability. Are
All these two kinds of function is OK but you need to make your class
extends Serializable.
But all these kinds of pass functions can not save data which will be send.
If you define a function which will not use member parameter of a class or
object, you can use val like definition method.
For
Hi,
Spark.local.dir is the one used to write map output data and persistent RDD
blocks, but the path of file has been hashed, so you cannot directly find the
persistent rdd block files, but definitely it will be in this folders on your
worker node.
Thanks
Jerry
From: Priya Ch
Is it possible to view the persisted RDD blocks ?
If I use YARN, RDD blocks would be persisted to hdfs then will i be able to
read the hdfs blocks as i could do in hadoop ?
On Tue, Sep 23, 2014 at 5:56 PM, Shao, Saisai [via Apache Spark User List]
ml-node+s1001560n14885...@n3.nabble.com wrote:
I couldnt even see the spark-id folder in the default /tmp directory of
local.dir.
On Tue, Sep 23, 2014 at 6:01 PM, Priya Ch learnings.chitt...@gmail.com
wrote:
Is it possible to view the persisted RDD blocks ?
If I use YARN, RDD blocks would be persisted to hdfs then will i be able
I am trying to submit a simple SparkPi application from a windows machine
which has spark 1.0.2 to a hadoop 2.3.0 cluster running on Linux. SparkPi
application can be launched and executed successfully when running on the
Linux machine, however, I get the following error when I launch from
This folder will be created when you start your Spark application under your
spark.local.dir, with the name “spark-local-xxx” as prefix. It’s quite strange
you don’t see this folder, maybe you miss something. Besides if Spark cannot
create this folder on start, persist rdd to disk will be
Since upgrading to Spark 1.1 we have been seeing the following error in the
logs:
14/09/23 02:14:42 ERROR executor.Executor: Exception in task 1087.0 in
stage 0.0 (TID 607)
java.io.IOException: unexpected exception type
at
great, thanks
--
Nan Zhu
On Tuesday, September 23, 2014 at 9:58 AM, Sean Owen wrote:
Yes, Matei made a JIRA last week and I just suggested a PR:
https://github.com/apache/spark/pull/2508
On Sep 23, 2014 2:55 PM, Nan Zhu zhunanmcg...@gmail.com
(mailto:zhunanmcg...@gmail.com) wrote:
Yes, Matei made a JIRA last week and I just suggested a PR:
https://github.com/apache/spark/pull/2508
On Sep 23, 2014 2:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
shall we document this in the API doc?
Best,
--
Nan Zhu
On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
shall we document this in the API doc?
Best,
--
Nan Zhu
On Sunday, September 21, 2014 at 12:18 PM, Debasish Das wrote:
zipWithUniqueId is also affected...
I had to persist the dictionaries to make use of the indices lower down in
the flow...
On Sun, Sep 21, 2014 at 1:15 AM, Sean
Thanks for looking into it. I'm trying to avoid making the user pass in any
parameters by configuring it to use the right values for the cluster size by
default, hence my reliance on the configuration. I'd rather just use
spark-defaults.conf than the environment variables, and looking at the
Hi,
What better way to use version 1.1.0 of the spark in ec2?
Att,
Giba
I know the recommendation is it depends, but can people share what sort of
memory allocations they're using for their driver processes? I'd like to get
an idea of what the range looks like so we can provide sensible defaults
without necessarily knowing what the jobs will look like. The
Hi,
I'm looking for available online ML algorithms (that improve model with new
streaming data). The only one I found is linear regression.
Is there anything else implemented as part of MLlib?
Thanks, Oleksiy.
Hi all,
I have a java object that contains a ML model which I would like to use for
prediction (in python). I just want to iterate the data through a mapper and
predict for each value. Unfortunately, this fails when it tries to serialise
the object to sent it to the nodes.
Is there a trick
Thank you for the answer and sorry for the double question, but now it works!
I have one additional question, is it possible to use a broadcast variable
in this object, at the moment I try it in the way below, but the broadcast
object is still null.
object lookupObject
{
private var treeFile :
Hi Andrew,
Thanks for the prompt response. I tried command line and it works fine. But, I
want to try from IDE for easier debugging and transparency into code execution.
I would try and see if there is any way to get the jar over to the executor
from within the IDE.
- Barrington
On Sep 21,
Hey,
Spark 1.1.0
Kafka 0.8.1.1
Hadoop (YARN/HDFS) 2.5.1
I have a five partition Kafka topic. I can create a single Kafka receiver via
KafkaUtils.createStream with five threads in the topic map and consume messages
fine. Sifting through the user list and Google, I see that its possible to
Hi,
I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge
master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs,
122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of
total data).
When I try to run TPC-H query 5, the query hangs for a
I thought I did a good job ;-)
OK, so what is the best way to initialize updateStateByKey operation? I have
counts from previous spark-submit, and want to load that in next spark-submit
job.
- Original Message -
From: Soumitra Kumar kumar.soumi...@gmail.com
To: spark users
You should create a pure Python object (copy the attributes from Java object),
then it could be used in map.
Davies
On Tue, Sep 23, 2014 at 8:48 AM, jamborta jambo...@gmail.com wrote:
Hi all,
I have a java object that contains a ML model which I would like to use for
prediction (in python).
Hi,
i'm trying to run hbase_inputformat.py example but i'm not getting.
this is the error:
Traceback (most recent call last):
File /root/spark/examples/src/main/python/hbase_inputformat.py, line
70, in module
conf=conf)
File /root/spark/python/pyspark/context.py, line 471, in
Hi Davies,
Thanks for the reply. I saw that you guys do that way in the code. Is
there no other way?
I have implemented all the predict functions in scala, so I prefer not
to reimplement the whole thing in python.
thanks,
On Tue, Sep 23, 2014 at 5:40 PM, Davies Liu dav...@databricks.com
Sure in local mode it works for me as well, the issue is that I run master
only, I needed worker as well.
תודה רבה,
משה בארי.
054-3133943
Email moshe.be...@gmail.com | linkedin http://www.linkedin.com/in/mobee
On Mon, Sep 22, 2014 at 9:58 AM, Akhil Das-2 [via Apache Spark User List]
Right now, there is no way to access JVM in Python worker, in order
to make this happen, we need to do:
1. setup py4j in Python worker
2. serialize the JVM objects and transfer to executors
3. link the JVM objects and py4j together to get an interface
Before these happens, maybe you could try to
Can you show me the DDL you are using? Here is an example of a way I got
the avro serde to work:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246
Also, this isn't ready for primetime yet, but a quick plug for some ongoing
work:
You can't directly query JSON tables from the CLI or JDBC server since
temporary tables only live for the life of the Spark Context. This PR will
eventually (targeted for 1.2) let you do what you want in pure SQL:
https://github.com/apache/spark/pull/2475
On Mon, Sep 22, 2014 at 4:52 PM, Yin
A workaround for now would be to save the JSON as parquet and the create a
metastore parquet table. Using parquet will be much faster for repeated
querying. This function might be helpful:
import org.apache.spark.sql.hive.HiveMetastoreTypes
def createParquetTable(name: String, file: String,
I don't know if it's relevant, but I had to compile spark for my specific
hbase and hadoop version to make that hbase_inputformat.py work.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-hbase-inputformat-py-not-work-tp14905p14912.html
Sent from
I am running the job on 500 executors, each with 8G and 1 core.
See lots of fetch failures on reduce stage, when running a simple
reduceByKey
map tasks - 4000
reduce tasks - 200
On Mon, Sep 22, 2014 at 12:22 PM, Chen Song chen.song...@gmail.com wrote:
I am using Spark 1.1.0 and have seen a
I'm running my application on a three nodes cluster(8 cores each, 12 G memory
each), and I receive the follow actor error, does anyone have any idea?
14:31:18,061 ERROR [akka.remote.EndpointWriter]
(spark-akka.actor.default-dispatcher-17) Transient association error
(association remains live):
Hi Oleksiy,
Right now, only streaming linear regression is available in MLlib. There
are working in progress on Streaming K-means and Streaming SVM. Please take
a look at the following jiras for more information.
Streaming K-means https://issues.apache.org/jira/browse/SPARK-3254
Streaming SVM
I solved it :) I moved the lookupObject into the function where I create the
broadcast and now all works very well!
object lookupObject
{
private var treeFile : org.apache.spark.broadcast.Broadcast[String] = _
def main(args: Array[String]): Unit = {
…
val treeFile = sc.broadcast(args(0))
At a high-level, the suggestion sounds good to me. However regarding code,
its best to submit a Pull Request on Spark github page for community
reviewing. You will find more information here.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
On Tue, Sep 23, 2014 at 10:11 PM,
Or maybe there is a bug related to the base64 in py4j, could you
dumps the serialized bytes of closure to verify this?
You could add a line in spark/python/pyspark/rdd.py:
ser = CloudPickleSerializer()
pickled_command = ser.dumps(command)
+ print len(pickled_command),
sowen wrote
it seems that the singular values from the SVD aren't returned, so I don't
know that you can access this directly
Its not clear to me why these aren't returned? The S matrix would be useful
to determine a reasonable value for K.
--
View this message in context:
Great. Thanks a lot.
On 23 Sep 2014 18:44, Davies Liu-2 [via Apache Spark User List]
ml-node+s1001560n14908...@n3.nabble.com wrote:
Right now, there is no way to access JVM in Python worker, in order
to make this happen, we need to do:
1. setup py4j in Python worker
2. serialize the JVM
In its current implementation, the principal components are computed in
MLlib in two steps:
1) In a distributed fashion, compute the covariance matrix - the result is
a local matrix.
2) On this local matrix, compute the SVD.
The sorting comes from the SVD. If you want to get the eigenvalues out,
Thanks very much for the pointer, which validated my initial approach. It
turns out that I was creating a tag for the abstract class
InputFormat.class. Using TextInputFormat.class instead fixed my issue.
Regards,
Mike
--
View this message in context:
I am trying to load data from csv format into parquet using Spark SQL.
It consistently runs out of memory.
The environment is:
* standalone cluster using HDFS and Hive metastore from HDP2.0
* spark1.1.0
* parquet jar files (v1.5) explicitly added when starting spark-sql.
I would hope that things should work for this kind of workflow.
I'm curious if you have tried using saveAsParquetFile instead of inserting
directly into a hive table (you could still register this as an external
table afterwards). Right now inserting into Hive tables is going to
through their
Posting your code would be really helpful in figuring out gotchas.
On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr...@gmail.com wrote:
Hey,
Spark 1.1.0
Kafka 0.8.1.1
Hadoop (YARN/HDFS) 2.5.1
I have a five partition Kafka topic. I can create a single Kafka receiver
via
This is actually a very tricky as their two pretty big challenges that need
to be solved.
(i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable
dont have checkpointing support (that is you cannot write the content of a
broadcast variable to HDFS and recover it automatically
I have only been using spark through the SQL front-end (CLI or JDBC). I don't
think I have access to saveAsParquetFile from there, do I?
--
View this message in context:
Hi,
I tried out the HdfsWordCount program in the Streaming module on a cluster.
Based on the output, I find that it counts only a few of the words. How can
I have it count all the words in the text? I have only one text file in the
directory.
thanks
--
View this message in context:
I have a general question on when persisting will be beneficial and when it
won't:
I have a task that runs as follow
keyedRecordPieces = records.flatMap( record = Seq(key, recordPieces))
partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)
partitoned.mapPartitions(doComputation).save()
I am not sure what you mean by data checkpoint continuously increase,
leading to recovery process taking time? Do you mean that in HDFS you are
seeing rdd checkpoint files being continuously written but never being
deleted?
On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB rodrigo.boav...@aspect.com
Hi Arun,
The intermediate results like keyedRecordPieces will not be materialized.
This indicates that if you run
partitoned = keyedRecordPieces.partitionBy(KeyPartitioner)
partitoned.mapPartitions(doComputation).save()
again, the keyedRecordPieces will be re-computed . In this case, cache or
Hi,
I'm trying to read some data in RCFiles using Spark, but can't seem to find
a suitable example anywhere. Currently I've written the following bit of
code that lets me count() the no. of records, but when I try to do a
collect() or a map(), it fails with a ConcurrentModificationException. I'm
Hi TD, tnks for getting back on this.
Yes that's what I was experiencing - data checkpoints were being recovered
from considerable time before the last data checkpoint, probably since the
beginning of the first writes, would have to confirm. I have some
development on this though.
These results
Hello,
Hello,
So I have crated a table in in RDD in spark in thei format:
col1col2
---
1. 10 11
2. 12 8
3. 9 13
4. 2 3
And the RDD is ristributed by the rows (rows 1, 2 on one node and rows 3 4 on
another)
I want to sort each column of the table so
Thanks Liquan, that makes sense, but if I am only doin the computation
once, there will essentially be no difference, correct?
I had second question related to mapPartitions
1) All of the records of the Iterator[T] that a single function call in
mapPartitions process must fit into memory,
Just a follow-up.
Just to make sure about the RDDs not being cleaned up, I just replayed the
app both on the windows remote laptop and then on the linux machine and at
the same time was observing the RDD folders in HDFS.
Confirming the observed behavior: running on the laptop I could see the
Yes... good find. I have filed a JIRA here:
https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it
shortly. Both of these fixes will be available in 1.1.1. Until both of
these are merged in, it appears that the only way you can do it now is
through --driver-memory.
-Andrew
I have an SBT Spark project compiling fine in Intellij.
However when I try to create a SparkContext from a worksheet:
import org.apache.spark.SparkContext
val sc1 = new SparkContext(local[8], sc1)
I get this error:
com.typesafe.config.ConfigException$Missing: No configuration setting found
for
You could pluck out each column in separate rdds, sort them independently,
and zip them :)
On Tue, Sep 23, 2014 at 2:40 PM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
abaghdasa...@bloomberg.net wrote:
Hello,
So I have crated a table in in RDD in spark in thei format:
col1 col2
I am trying to get an edge server up and running connecting to our Spark 1.1
cluster. The edge server is in a different DMZ than the rest of the cluster
and we have to specifically open firewall ports between the edge server and the
rest of the cluster. I can log on to any node in the
Sorry, I am almost Java illiterate but here's my Scala code to do the
equivalent (that I have tested to work):
val kInStreams = (1 to 10).map{_ =
KafkaUtils.createStream(ssc,zkhost.acme.net:2182,myGrp,Map(myTopic
- 1), StorageLevel.MEMORY_AND_DISK_SER) } //Create 10 receivers
across the cluster,
To my eyes, these are functionally equivalent. I’ll try a Scala approach, but
this may cause waves for me upstream (e.g., non-Java)
Thanks for looking at this. If anyone else can see a glaring issue in the Java
approach that would be appreciated.
Thanks,
Matt
On Sep 23, 2014, at 4:13 PM,
Hi,
After executing sql() in SQLContext or HiveContext, is there a way to tell
whether the query/command succeeded or failed? Method sql() returns SchemaRDD
which either is empty or contains some Rows of results. However, some queries
and commands do not return results by nature; being empty
When I experimented with using an InputFormat I had used in Hadoop for a
long time in Hadoop I found
1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated
class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
2) initialize needs to be called in the constructor
3)
I'm trying out some streaming with spark and I'm getting an error that
puzzles me since I'm new to Spark. I get this error all the time but 1-2
batches in the stream are processed before the job stops. but never the
complete job and often no batch is processed at all. I use Spark 1.1.0.
The job
Is your file managed by Hive (and thus present in a Hive metastore)? In that
case, Spark SQL
(https://spark.apache.org/docs/latest/sql-programming-guide.html) is the
easiest way.
Matei
On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com)
wrote:
Hi,
I'm trying to
You dataset is small. NaiveBayes should work under the default
settings, even in local mode. Could you try local mode first without
changing any Spark settings? Since your dataset is small, could you
save the vectorized data (RDD[LabeledPoint]) and send me a sample? I
want to take a look at the
Thanks,
I didn't create the tables myself as I have no control over that process.
However these tables are read just fund using the Jdbc connection to the
hiveserver2 so it should be possible
On Sep 24, 2014 12:48 AM, Michael Armbrust mich...@databricks.com wrote:
Can you show me the DDL you are
Hi Paul,
There are several ports you need to configure in order to run in a tight
network environment. It sounds like you the DMZ that contains the spark
cluster is wide open internally, but you have to poke holes between that
and the driver.
You should take a look at the port configuration
Hi Theodore,
What do you mean by module diagram? A high level architecture diagram of
how the classes are organized into packages?
Andrew
On Tue, Sep 23, 2014 at 12:46 AM, Theodore Si sjyz...@gmail.com wrote:
Hi,
Please help me with that.
BR,
Theodore Si
I execute it as follows:
$SPARK_HOME/bin/spark-submit --master master url --class
org.apache.spark.examples.streaming.HdfsWordCount
target/scala-2.10/spark_stream_examples-assembly-1.0.jar hdfsdir
After I start the job, I add a new test file in hdfsdir. It is a large text
file which I
Also you'd rather have 2-3 tasks per core than 1 task per core because if
the 1 task per core is actually 1.01 tasks per core, then you have one wave
of tasks complete and another wave of tasks with very few tasks in them.
You get better utilization when you're higher than 1.
Aaron Davidson goes
Xiangrui, Thanks for replying.
I am using the subset of newsgroup20 data. I will send you the vectorized
data for analysis shortly.
I have tried running in local mode as well but I get the same OOM exception.
I started with 4GB of data but then moved to smaller set to verify that
everything
An exception should be thrown in the case of failure for DDL commands.
On Tue, Sep 23, 2014 at 4:55 PM, Du Li l...@yahoo-inc.com.invalid wrote:
Hi,
After executing sql() in SQLContext or HiveContext, is there a way to
tell whether the query/command succeeded or failed? Method sql()
Hello,
I am new to Spark. I have downloaded Spark 1.1.0 and trying to run the
TallSkinnySVD.scala example with different input data sizes. I tried with
input data with 1000X1000 matrix, 5000X5000 matrix.
Though I had faced some Java Heap issues I added following parameters in
spark-defaults.conf
Hi,
Is it always possible to get one RDD from another.
For example, if I do a *top(K)(Ordering)*, I get an Int right? (In my
example the type is Int). I do not get an RDD.
Can anyone explain this to me?
Thank You
Hi Andrew,
I can't speak for Theodore, but I would find that incredibly useful.
Dave
On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash and...@andrewash.com wrote:
Hi Theodore,
What do you mean by module diagram? A high level architecture diagram of
how the classes are organized into packages?
This process began yesterday and it has already run for more than 20 hours.
Is it normal? Any one has the same problem? No error throw out yet.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-long-does-it-take-executing-sbt-sbt-assembly-tp14975.html
Here is my understanding
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = {
if (num == 0) { //if 0, return empty array
Array.empty
} else {
mapPartitions { items = //map each partition to a a new one
with the iterator consists of the single queue,
Definitely something wrong. For me, 10 to 30 minutes.
Thanks.
Zhan Zhang
On Sep 23, 2014, at 10:02 PM, christy 760948...@qq.com wrote:
This process began yesterday and it has already run for more than 20 hours.
Is it normal? Any one has the same problem? No error throw out yet.
--
Hi,
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-building-project-with-sbt-assembly-is-extremely-slow-td13152.html
-- Maybe related to this?
Tobias
I tested the examples according to the docs in spark sql programming guide,
but the java.lang.stackoverflowerror occurred everytime I called
sqlContext.sql(...).
Meanwhile, it worked fine in a hiveContext. The Hadoop version is 2.2.0, the
Spark version is 1.1.0, built with Yarn, Hive.. I would be
If you have enough memory, the speed will be faster, within one minutes, since
most of the files are cached. Also you can build your Spark project on a
mounted ramfs in Linux, this will also speed up the process.
Thanks
Jerry
-Original Message-
From: Zhan Zhang
I installed ganglia, and I think it worked well for hadoop, hbase for I can
see hadoop/hbase metrics on ganglia-web.I want to use ganglia to monitor
spark. and I followed the steps as following:1) first I did a custom compile
with -Pspark-ganglia-lgpl, and it sucessed without
90 matches
Mail list logo