Cached RDD do not survive SparkContext deletion (they are scoped on a per
sparkcontext basis).
I am not sure what you mean by disk based cache eviction, if you cache more
RDD than disk space the result will not be very pretty :)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
I think she is checking for blanks?
But if the RDD is blank then nothing will happen, no db connections etc.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Sep 8, 2014 at 1:32 PM, Tobias Pfeiffer t...@preferred.jp
Thank you very much for your kindly help. I rise some another questions:
- If the RDD is stored in serialized format, is that means that whenever
the RDD is processed, it will be unpacked and packed again from and back to
the JVM even they are located on the same machine?
I somehow missed that parameter when I was reviewing the documentation,
that should do the trick! Thank you!
2014-09-10 2:10 GMT+01:00 Shao, Saisai saisai.s...@intel.com:
Hi Luis,
The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be
used to remove useless timeout
Hi,
I am using spark 1.0.0. The bug is fixed by 1.0.1.
Hao
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13864.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi:
I Hava a question about Spark SQL。
First ,i use left join on two tables,like this:
sql(SELECT * FROM youhao_data left join youhao_age on
(youhao_data.rowkey=youhao_age.rowkey)).collect().foreach(println)
the result is my except。
But,when i use left join on three tables or more ,like this:
Great. And you should ask question in user@spark.apache.org mail list. I
believe many people don't subscribe the incubator mail list now.
--
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
On Wednesday, September 10, 2014 at 6:03 PM, redocpot wrote:
Hi,
I am using
Hello,
I am writing a Spark App which is already working so far.
Now I started to build also some UnitTests, but I am running into some
dependecy problems and I cannot find a solution right now. Perhaps
someone could help me.
I build my Spark Project with SBT and it seems to be configured
Hi,
What is your SBT command and the parameters?
Arthur
On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote:
Hello,
I am writing a Spark App which is already working so far.
Now I started to build also some UnitTests, but I am running into some
dependecy problems and
Hi,
I just called:
test
or
run
Thorsten
Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:
Hi,
What is your SBT command and the parameters?
Arthur
On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler sp...@tbonline.de wrote:
Hello,
I am writing a Spark App which is already working
Ah, thank you. I did not notice that.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698p13871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi ,
I have a question on spark
this programm on spark-shell
val filerdd = sc.textFile(NOTICE,2)
val maprdd = filerdd.map( word = filerdd.map( word2 = (word2+word) ) )
maprdd.collect()
throws NULL pointer exception ,
can somebody explain why i cannot have a nested rdd operation ?
--pavlos
Is it possible to generate a JavaPairRDDString, Integer from a
JavaPairRDDString, String, where I can also use the key values? I have
looked at for instance mapToPair, but this generates a new K/V pair based on
the original value, and does not give me information about the key.
I need this in the
Hi,
I too had tried SQL queries with joins, MINUS , subqueries etc but they did
not work in Spark Sql.
I did not find any documentation on what queries work and what do not work
in Spark SQL, may be we have to wait for the Spark book to be released in
Feb-2015.
I believe you can try HiveQL in
You can't use an RDD inside an operation on an RDD. Here you have
filerdd in your map function. It sort of looks like you want a
cartesian product of the RDD with itself, so look at the cartesian()
method. It may not be a good idea to compute such a thing.
On Wed, Sep 10, 2014 at 1:57 PM, Pavlos
| Do the two mailing lists share messages ?
I don't think so. I didn't receive this message from the user list. I am not
in databricks, so I can't answer your other questions. Maybe Davies Liu
dav...@databricks.com can answer you?
--
Ye Xianjin
Sent with Sparrow
So, each key-value pair gets a new value for the original key? you
want mapValues().
On Wed, Sep 10, 2014 at 2:01 PM, Tom thubregt...@gmail.com wrote:
Is it possible to generate a JavaPairRDDString, Integer from a
JavaPairRDDString, String, where I can also use the key values? I have
looked at
Hi Friends,
I'm using spark streaming as for kafka consumer. I want to do the CEP by
spark. So as for that I need to store my sequence of events. so that I cant
detect some pattern.
My question is How can I save my events in java collection temporary , So
that i can detect pattern by
Spark friends,
I recently wrote up a blog post with examples of some of the standard
techniques for improving Spark application performance:
http://chapeau.freevariable.com/2014/09/improving-spark-application-performance.html
The idea is that we start with readable but poorly-performing
Hi,
May be you can take a look about the following.
http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html
Good luck.
Arthur
On 10 Sep, 2014, at 9:09 pm, arunshell87 shell.a...@gmail.com wrote:
Hi,
I too had tried SQL queries with joins, MINUS ,
You mention that you are creating a UserGroupInformation inside your
function, but something is still serializing it. You should show your
code since it may not be doing what you think.
If you instantiate an object, it happens every time your function is
called. map() is called once per data
Hi all,
What is your preferred scala NLP lib ? why ?
Is there any items on the spark’s road map to integrate NLP features ?
I basically need to perform NER line by line, so I don’t need a deep
integration with the distributed engine.
I only want simple dependencies and the chance to build a
Thanks Sean.
Please find attached my code. Let me know your suggestions/ideas.
Regards,
*Sarath*
On Wed, Sep 10, 2014 at 8:05 PM, Sean Owen so...@cloudera.com wrote:
You mention that you are creating a UserGroupInformation inside your
function, but something is still serializing it. You
Hi ,
Thanks it worked, really appreciate your help. I have also been trying to do
multiple Lateral Views, but it doesn't seem to be working.
Query :
hiveContext.sql(Select t2 from fav LATERAL VIEW explode(TABS) tabs1 as t1
LATERAL VIEW explode(t1) tabs2 as t2)
Exception
Hi ,
I try to evaluate different option of spark + cassandra and I have couple
of additional questions.
My aim is to use cassandra only without hadoop:
1) Is it possible to use only cassandra as input/output parameter for
PySpark?
2) In case I'll use Spark (java,scala) is it possible to
Hi,
I am having difficulty getting the Cassandra connector running within the spark
shell.
My jars looks like:
[wwilkins@phel-spark-001 logs]$ ls -altr /opt/connector/
total 14588
drwxr-xr-x. 5 root root4096 Sep 9 22:15 ..
-rw-r--r-- 1 root root 242809 Sep 9 22:20
Are you using spark 1.1 ? If yes you would have to update the datastax
cassandra connector code and remove ref to log methods from
CassandraConnector.scala
Regards,
Gaurav
--
View this message in context:
Hi ,
The latest changes with Kafka message re-play by manipulating ZK offset
seems to be working fine for us. This gives us some relief till actual
issue is fixed in Spark 1.2 .
I have some question on how Spark process the Received data. The logic I
used is basically to pull messages form
I think the mails to spark.incubator.apache.org will be forwarded to
spark.apache.org.
Here is the header of the first mail:
from: redocpot julien19890...@gmail.com
to: u...@spark.incubator.apache.org
date: Mon, Sep 8, 2014 at 7:29 AM
subject: groupBy gives non deterministic results
mailing
Yes your understanding is correct. In that case one easiest option would be
to Serialize the object and dump it somewhere in hdfs so that you will be
able to recreate/update the object from the file.
We have something similar which you can find over BroadCastServer
My bad, please ignore, it works !!!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/flattening-a-list-in-spark-sql-tp13300p13901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
What version of Spark SQL are you running here? I think a lot of your
concerns have likely been addressed in more recent versions of the code /
documentation. (Spark 1.1 should be published in the next few days)
In particular, for serious applications you should use a HiveContext and
HiveQL as
HiveQL is the default language for the JDBC server which will be available
as part of the 1.1 release (coming very soon!). Adding support for calling
MLlib and other spark libraries is on the roadmap, but not possible at this
moment.
On Tue, Sep 9, 2014 at 1:45 PM, XUE, Xiaohui
Thank you!
I am running spark 1.0 but, your suggestion worked for me.
I rem'ed out all
//logDebug
In both
CassandraConnector.scala
and
Schema.scala
I am moving again.
Regards,
Wade
-Original Message-
From: gtinside [mailto:gtins...@gmail.com]
Sent: Wednesday, September 10, 2014 8:49
Just want to provide more information on how I ran the examples.
Environment: Cloudera quick start Vm 5.1.0 (HBase 0.98.1 installed). I
created a table called 'data1', and 'put' two records in it. I can see the
table and data are fine in hbase shell.
I cloned spark repo and checked out to 1.1
Hi Ravi,
2014-09-10 15:55 GMT+02:00 Ravi Sharma raviprincesha...@gmail.com:
I'm using spark streaming as for kafka consumer. I want to do the CEP by
spark. So as for that I need to store my sequence of events. so that I cant
detect some pattern.
Depending on what you're trying to accomplish,
jarAk ifz ais iviiiaiiiIsirseikejujuran8$-88Ω=-O Ω:-P in iCdsiiisOz) :)
(isuii:V) (:V) riiie89θ
Well, That's weird. I don't see this thread in my mail box as sending to user
list. Maybe because I also subscribe the incubator mail list? I do see mails
sending to incubator mail list and no one replies. I thought it was because
people don't subscribe the incubator now.
--
Ye Xianjin
Sent
You're using hadoopConf, a Configuration object, in your closure.
That type is not serializable.
You can use -Dsun.io.serialization.extendedDebugInfo=true to debug
serialization issues.
On Wed, Sep 10, 2014 at 8:23 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Thanks Sean.
How are you creating your kafka streams in Spark?
If you have 10 partitions for a topic, you can call createStream ten
times to create 10 parallel receivers/executors and then use union to
combine all the dStreams.
On Wed, Sep 10, 2014 at 7:16 AM, richiesgr richie...@gmail.com wrote:
Hi (my
I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case,
the receivers die within an hour because Yarn kills the containers for high
memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I
don't think stale RDDs are an issue here. I did a jmap -histo on a couple
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
Slide 39 covers it.
On Tue, Sep 9, 2014 at 9:23 PM, qihong qc...@pivotal.io wrote:
Hi Mayur,
Thanks for your response. I did write a simple test that set up a DStream
with
5 batches; The
Tim, I asked a similar question twice:
here
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-executors-to-stay-alive-tt12940.html
and here
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Executor-OOM-tt12383.html
and have not yet received any responses. I
I had a similar issue and many others - all were basically symptoms for
yarn killing the container for high memory usage. Haven't gotten to root
cause yet.
On Tue, Sep 9, 2014 at 3:18 PM, Marcelo Vanzin van...@cloudera.com wrote:
Your executor is exiting or crashing unexpectedly:
On Tue, Sep
Hi,
As part of SparkContext.newAPIHadoopRDD(). Would Spark support an InputFormat
that uses Hadoop’s distributed cache?
Thanks,
Máximo
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands,
I use
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
to help me with testing.
In spark 9.1 my tests depending on TestSuiteBase worked fine. As soon as I
switched to latest (1.0.1) all tests fail. My sbt import is:
Thanks for the clarification, Yadid. By Hadoop jobs, I meant Spark jobs
that use Hadoop inputformats (as shown in the cassandra_inputformat.py
example).
A future possibility of accessing Cassandra from PySpark is when SparkSQL
supports Cassandra as a data source.
On Wed, Sep 10, 2014 at 11:37
I used the hiveContext to register the tables and the tables are still not
being found by the thrift server. Do I have to pass the hiveContext to JDBC
somehow?
--
View this message in context:
I've been doing some Googling and haven't found much info on how to
incorporate Spark and Accumulo. Does anyone know of some examples of how to
tie Spark to Accumulo (for both fetching data and dumping results)?
--
View this message in context:
Actually, when registering the table, it is only available within the sc
context you are running it in. For Spark 1.1, the method name is changed to
RegisterAsTempTable to better reflect that.
The Thrift server process runs under a different process meaning that it cannot
see any of the
Hi Denny,
There is a related question by the way.
I have a program that reads in a stream of RDD¹s, each of which is to be
loaded into a hive table as one partition. Currently I do this by first
writing the RDD¹s to HDFS and then loading them to hive, which requires
multiple passes of HDFS I/O
Hey guys,
After rebuilding from the master branch this morning, I’ve started to see these
errors that I’ve never gotten before while running connected components. Anyone
seen this before?
14/09/10 20:38:53 INFO collection.ExternalSorter: Thread 87 spilling in-memory
batch of 1020 MB to disk
Just a guess.
updateStateByKey has overloaded variants with partitioner as parameter. Can it
help?
-Original Message-
From: qihong [mailto:qc...@pivotal.io]
Sent: Tuesday, September 09, 2014 9:13 PM
To: u...@spark.incubator.apache.org
Subject: Re: how to setup steady state stream
Great, good to know it's not just me...
2014-09-10 11:25 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
Yes please pretty please. This is really annoying.
On Sun, Sep 7, 2014 at 6:31 AM, Ognen Duzlevski ognen.duzlev...@gmail.com
wrote:
I keep getting below reply every time I send a
It's very straightforward to set up a Hadoop RDD to use
AccumuloInputFormat. Something like this will do the trick:
private JavaPairRDDKey,Value newAccumuloRDD(JavaSparkContext sc,
AgileConf agileConf, String appName, Authorizations auths)
throws IOException, AccumuloSecurityException {
To answer my own question... I didn't realize that I was responsible for
telling Spark how much parallelism I wanted for my job. I figured that
between Spark and Yarn they'd figure it out for themselves.
Adding --executor-memory 3G --num-executors 24 to my spark-submit command
took the query time
Thanks for your response! I found that too, and it does the trick! Here's
refined code:
val inputDStream = ...
val keyedDStream = inputDStream.map(...) // use sensorId as key
val partitionedDStream = keyedDstream.transform(rdd = rdd.partitionBy(new
MyPartitioner(...)))
val stateDStream =
On Mon, Sep 8, 2014 at 11:15 PM, Sean Owen so...@cloudera.com wrote:
This structure is not specific to Hadoop, but in theory works in any
JAR file. You can put JARs in JARs and refer to them with Class-Path
entries in META-INF/MANIFEST.MF.
Funny that you mention that, since someone internally
On Wed, Sep 10, 2014 at 1:05 AM, Boxian Dong box...@indoo.rs wrote:
Thank you very much for your kindly help. I rise some another questions:
- If the RDD is stored in serialized format, is that means that whenever
the RDD is processed, it will be unpacked and packed again from and back to
Hm, so it is:
http://docs.oracle.com/javase/tutorial/deployment/jar/downman.html
I'm sure I've done this before though and thought is was this mechanism. It
must be something custom.
What's the Hadoop jar structure in question then? Is it something special
like a WAR file? I confess I had never
In modern projects there are a bazillion dependencies - when I use Hadoop I
just put them in a lib directory in the jar - If I have a project that
depends on 50 jars I need a way to deliver them to Spark - maybe wordcount
can be written without dependencies but real projects need to deliver
On Wed, Sep 10, 2014 at 3:44 PM, Sean Owen so...@cloudera.com wrote:
What's the Hadoop jar structure in question then? Is it something special
like a WAR file? I confess I had never heard of this so thought this was
about generic JAR stuff.
What I've been told (and Steve's e-mail alludes to)
Try providing full path to the file you want to write, and make sure the
directory exists and is writable by the Spark process.
On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra arun.lut...@gmail.com wrote:
I have a spark program that worked in local mode, but throws an error in
yarn-client mode on
Hi,
I have a small graph with about 3.3M vertices and close to 7.5M edges. It's a
pretty innocent graph with the max degree of 8.
Unfortunately, graph.traingleCount is failing on me with the exception below.
I'm running a spark-shell on CDH5.1 with the following params :
SPARK_DRIVER_MEM=10g
Ok, so I don't think the workers on the data nodes will be able to see my
output directory on the edge node. I don't think stdout will work either,
so I'll write to HDFS via rdd.saveAsTextFile(...)
On Wed, Sep 10, 2014 at 3:51 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:
Try providing full
Hi,michael :
I think Arthur.hk.chan isn't here now,I Can Show something:
1)my spark version is 1.0.1
2) when I use multiple join ,like this:
sql(SELECT * FROM youhao_data left join youhao_age on
(youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
Hi,
You can use this Kafka Spark Consumer.
https://github.com/dibbhatt/kafka-spark-consumer
This is exactly does that . It creates parallel Receivers for every Kafka
topic partitions. You can see the Consumer.java under consumer.kafka.client
package to see an example how to use it.
There is
I switched from Yarn to StandAlone mode and haven't had OOM issue yet.
However, now I have Akka issues killing the executor:
2014-09-11 02:43:34,543 INFO akka.actor.LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying]
from
Hi,
I am using pyspark shell and am trying to create an rdd from numpy matrix
rdd = sc.parallelize(matrix)
I am getting the following error:
JVMDUMP039I Processing dump event systhrow, detail
java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait.
JVMDUMP032I JVM requested Heap dump
69 matches
Mail list logo