Hm, is this not just showing that you're rate-limited by how fast you
can get events to the cluster? you have more network bottleneck
between the data source and cluster in the cloud than your local
cluster.
On Tue, Oct 14, 2014 at 9:44 PM, danilopds danilob...@gmail.com wrote:
Hi,
I'm learning
The problem is not ReduceWords, since it is already Serializable by
implementing Function2. Indeed the error tells you just what is
unserializable: KafkaStreamingWordCount, your driver class.
Something is causing a reference to the containing class to be
serialized in the closure. The best fix is
Hi All,
Could someone shed a light to why when adding element into MutableList can
result in type mistmatch, even if I'm sure that the class type is right?
Below is the sample code I run in spark 1.0.2 console, at the end of line,
there is an error type mismatch:
Welcome to
Another instance of https://issues.apache.org/jira/browse/SPARK-1199 ,
fixed in subsequent versions.
On Wed, Oct 15, 2014 at 7:40 AM, Henry Hung ythu...@winbond.com wrote:
Hi All,
Could someone shed a light to why when adding element into MutableList can
result in type mistmatch, even if
Can anyone help me, please?
在 10/14/2014 9:58 PM, Theodore Si 写道:
Hi all,
I have two nodes, one as master(*host1*) and the other as
worker(*host2*). I am using the standalone mode.
After starting the master on host1, I run
$ export MASTER=spark://host1:7077
$ bin/run-example SparkPi 10
on
[Removing dev lists]
You are absolutely correct about that.
Prashant Sharma
On Tue, Oct 14, 2014 at 5:03 PM, Priya Ch learnings.chitt...@gmail.com
wrote:
Hi Spark users/experts,
In Spark source code (Master.scala Worker.scala), when registering the
worker with master, I see the usage
You say you reduceByKey but are you really collecting all the tuples
for a vehicle in a collection, like what groupByKey does already? Yes,
if one vehicle has a huge amount of data that could fail.
Otherwise perhaps you are simply not increasing memory from the default.
Maybe you can consider
How did you recompile and deploy Spark to your cluster? it sounds like
a problem with not getting the assembly deployed correctly, rather
than your app.
On Tue, Oct 14, 2014 at 10:35 PM, Tamas Sandor tsan...@gmail.com wrote:
Hi,
I'm rookie in spark, but hope someone can help me out. I'm
Examine the output (replace $YARN_APP_ID in the following with the
application identifier output by the previous command) (Note:
YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs
depending on the Hadoop version.)
$ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_01/stdout.
Hi Jimmy,
Did you try my patch?
The problem on my side was that the hadoop.tmp.dir (in hadoop core-site.xml)
was not handled properly by Spark when it is set on multiple partitions/disks,
i.e.:
property
namehadoop.tmp.dir/name
Hi,
we are Spark users and we use some Spark's test classes for our own application
unit tests. We use LocalSparkContext and SharedSparkContext. But these classes
are not included in the spark-core library. This is a good option as it's not a
good idea to include test classes in the runtime
Hi,
We really would like to use Spark but we can’t because we have a secure HDFS
environment (Cloudera).
I understood https://issues.apache.org/jira/browse/SPARK-2541 contains a patch.
Can one of the committers please take a look?
Thanks!
Erik.
—
Erik van Oosten
Hi ,I'm pretty new to Big Data Spark both. I've just started POC work on
spark and me my team are evaluating it with other In Memory computing
tools such as GridGain, Bigmemory, Aerospike some others too, specifically
to solve two sets of problems.1) Data Storage : Our current application
runs
Hi,
How large is the dataset you're saving into S3?
Actually saving to S3 is done in two steps:
1) writing temporary files
2) commiting them to proper directory
Step 2) could be slow because S3 do not have a quick atomic move
operation, you have to copy (server side but still takes time) and then
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found
that DataTypeConversions is protected[sql].
Finally I find this solution:
pre
code
jrdd.registerTempTable(transform_tmp)
jrdd.sqlContext.sql(select * from transform_tmp)
/code
/pre
Could Any One tell me
which means the details are not persisted and hence any failures in workers
and master wouldnt start the daemons normally ..right ?
On Wed, Oct 15, 2014 at 12:17 PM, Prashant Sharma [via Apache Spark User
List] ml-node+s1001560n16468...@n3.nabble.com wrote:
[Removing dev lists]
You are
So if you need those features you can go ahead and setup one of Filesystem
or zookeeper options. Please take a look at:
http://spark.apache.org/docs/latest/spark-standalone.html.
Prashant Sharma
On Wed, Oct 15, 2014 at 3:25 PM, Chitturi Padma
learnings.chitt...@gmail.com wrote:
which means
I just ran the same code and it is running perfectly fine on my machine.
These are the things on my end:
- Spark version: 1.1.0
- Gave full path to the negative and positive files
- Set twitter auth credentials in the environment.
And here's the code:
import org.apache.spark.SparkContext
What results do you want?
If your pair is like (a, b), where a is the key and b is the value, you
can try
rdd1 = rdd1.flatMap(lambda l: l)
and then use cogroup.
Best
Gen
--
View this message in context:
Hi,
I am using 1.1.0. I did set my twitter credentials and I am using the full
path. I did not paste this in the public post. I am running on a cluster
and getting the exception. Are you running in local or standalone mode?
Thanks
On Oct 15, 2014 3:20 AM, Akhil Das ak...@sigmoidanalytics.com
I ran it in both local and standalone, it worked for me. It does throws a
bind exception which is normal since we are using both SparkContext and
StreamingContext.
Thanks
Best Regards
On Wed, Oct 15, 2014 at 5:25 PM, S Krishna skrishna...@gmail.com wrote:
Hi,
I am using 1.1.0. I did set my
How did you resolve it?
On Tue, Jul 15, 2014 at 3:50 AM, SK skrishna...@gmail.com wrote:
The problem is resolved. Thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html
Sent from the Apache Spark User List
Hi,
The following query in sparkSQL 1.1.0 CLI doesn't work.
*SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse
;
create table test as
select v1.*, v2.card_type, v2.card_upgrade_time_black,
v2.card_upgrade_time_gold
from customer v1 left join customer_loyalty v2
on v1.account_id =
Hi,
I have a Spark standalone example application which is working fine.
I'm now trying to integrate this application into a J2EE application, deployed
on JBoss 7.1.1 and accessed via a web service. The JBoss server is installed on
my local machine (Windows 7) and the master Spark is remote
In order to share an HBase connection pool, we create an object
Object Util {
val HBaseConf = HBaseConfiguration.create
val Connection= HConnectionManager.createConnection(HBaseConf)
}
which would be shared among tasks on the same executor. e.g.
val result = rdd.map(line = {
val table
It is wonderful to see some idea.
Now the questions:
1) What is a track segment?
Ans) It is the line that contains two adjacent points when all points are
arranged by time. Say a vehicle moves (t1, p1) - (t2, p2) - (t3, p3).
Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1
+user@hbase
2014-10-15 20:48 GMT+08:00 Fengyun RAO raofeng...@gmail.com:
We use Spark 1.1, and HBase 0.98.1-cdh5.1.0, and need to read and write an
HBase table in Spark program.
I notice there are:
spark.driver.extraClassPath
spark.executor.extraClassPathproperties to manage extra
Ok,
I understand.
But in both cases the data are in the same processing node.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416p16501.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Pardon me - there was typo in previous email.
Calling table.close() is the recommended approach.
HConnectionManager does reference counting. When all references to the
underlying connection are gone, connection would be released.
Cheers
On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu
I am writing to HBase, following are my options:
export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar
spark-submit \
--jars
This is still happening to me on mesos. Any workarounds?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312p16506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
It looks like you're making the StreamingContext and SparkContext
separately from the same conf. Instead, how about passing the
SparkContext to the StreamingContext constructor? it seems like better
practice and is a guess at the problem cause.
On Tue, Oct 14, 2014 at 9:13 PM, SK
hi there... is there any other matrix operations in addition to multiply()?
like addition or dot product?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/matrix-operations-tp16508.html
Sent from the Apache Spark User List mailing list archive at
hi.. it looks like RowMatrix.multiply() takes a local Matrix as a parameter
and returns the result as a distributed RowMatrix.
how do you perform this series of multiplications if A, B, C, and D are all
RowMatrix?
((A x B) x C) x D)
thanks!
--
View this message in context:
Hi Yin,
pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive
0.12 from existing Avro data using CREATE TABLE following by an INSERT
OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition
while pqt_segcust_snappy has two partitions. For
I guess I was a little light on the details in my haste. I'm using Spark on
YARN, and this is in the driver process in yarn-client mode (most notably
spark-shell). I've had to manually add a bunch of JARs that I had thought it
would just pick up like everything else does:
export
From this line : Removing executor app-20141015142644-0125/0 because it is
EXITED I would guess that you need to examine the executor log to see why
the executor actually exited. My guess would be that the executor cannot
connect back to your driver. But check the log from the executor. It should
Hi,
I am trying to persist the files generated as a result of Naive bayes
training with MLlib. These comprise of the model file, label index(own
class) and term dictionary(own class). I need to save them on an HDFS
location and then deserialize when needed for prediction.
How can I do the same
Hi,
I compiled spark 1.1.0 with CDH 4.6 but when I try to get spark-sql cli up,
it gives error:
==
[atangri@pit-uat-hdputil1 bin]$ ./spark-sql
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
I see Hive 0.10.0 metastore sql does not have a VERSION table but spark is
looking for it.
Anyone else faced this issue or any ideas on how to fix it ?
Thanks,
Anurag Tangri
On Wed, Oct 15, 2014 at 10:51 AM, Anurag Tangri atan...@groupon.com wrote:
Hi,
I compiled spark 1.1.0 with CDH 4.6
Hi Anurag,
Spark SQL (from the Spark standard distribution / sources) currently
requires Hive 0.12; as you mention, CDH4 has Hive 0.10, so that's not
gonna work.
CDH 5.2 ships with Spark 1.1.0 and is modified so that Spark SQL can
talk to the Hive 0.13.1 that is also bundled with CDH, so if
Hi Marcelo,
Exactly. Found it few minutes ago.
I ran mysql hive 12 sql on my hive 10 metastore, which created missing
tables and it seems to be working now.
Not sure if everything else in CDH 4.6/Hive 10 would also still be working
though or not.
Looks like we cannot use Spark SQL in a clean
Hi there, I'm running spark on ec2, and am running into an error there that
I don't get locally. Here's the error:
11335 [handle-read-write-executor-3] ERROR
org.apache.spark.network.SendingConnection - Exception while reading
SendingConnection to ConnectionManagerId([IP HERE])
You are right. Creating the StreamingContext from the SparkContext instead of
SparkConf helped. Thanks for the help.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Sentiment-Analysis-of-Twitter-streams-tp16410p16520.html
Sent from the
Hi,
My setup: tomcat (running a web app which initializes SparkContext) and
dedicated Spark cluster (1 master 2 workers, 1VM per each).
I am able to properly start this setup where SparkContext properly
initializes connection with master. I am able to execute tasks and perform
required
Hi All,
I figured out what the problem was. Thank you Sean for pointing me in the
right direction. All the jibber jabber about empty DStream / RDD was all
just pure nonsense [?] . I guess the sequence of events (the fact that spark
streaming started crashing just after I implemented the
Hi,
As a result of a reduction operation, the resultant value score is a
DStream[Int] . How can I get the simple Int value?
I tried score[0], and score._1, but neither worked and can't find a
getValue() in the DStream API.
thanks
--
View this message in context:
Hi,
I am evaluating Sparking Streaming with kafka and i found that spark streaming
is slower than Spark. It took more time is processing same amount of data as
per the Spark Console it can process 2300 Records per seconds.
Is my assumption is correct? Spark Streaming has to do a lot of this
Hi Greg,
I'm not sure exactly what it is that you're trying to achieve, but I'm
pretty sure those variables are not supposed to be set by users. You
should take a look at the documentation for
spark.driver.extraClassPath and spark.driver.extraLibraryPath, and
the equivalent options for executors.
Hi,
I want to check the DEBUG log of spark executor on YARN(using yarn-cluster
mode), but
1. yarn daemonlog setlevel DEBUG YarnChild.class
2. set log4j.properties in spark/conf folder on client node.
no means above works.
So how could i set the log level of spark executor* on YARN container
Hi Eric,
Check the Debugging Your Application section at:
http://spark.apache.org/docs/latest/running-on-yarn.html
Long story short: upload your log4j.properties using the --files
argument of spark-submit.
(Mental note: we could make the log level configurable via a system property...)
On
Hi Xiangrui,
I am using yarn-cluster mode. The current hadoop cluster is configured to
only accept yarn-cluster mode and not allow yarn-client mode. I have no
prevelige to change that.
Without initializing with k-means||, the job finished in 10 minutes. With
k-means, it just hangs there for
Hi -
Has anybody figured out how to integrate a Play application with Spark and run
it on a Spark cluster using spark-submit script? I have seen some blogs about
creating a simple Play app and running it locally on a dev machine with sbt run
command. However, those steps don't work for
Hi
Anyone can share a project as a sample? I tried them a couple days ago but
couldn't make it work. Looks like it's due to some Kafka dependency issue.
I'm using sbt-assembly.
Thanks
Gary
I have a Spark application which is running Spark Streaming and Spark
SQL.
I observed the size of shuffle files under spark.local.dir folder
keeps increase and never decreases. Eventually it will run
out-of-disk-space error.
The question is: when will Spark delete these shuffle files?
In the
Anybody with good hands on with Spark, please do reply. It would help us a
lot!!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477p16536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I would like to reiterate that I don't have Hive installed on the Hadoop
cluster.
I have some queries on following comment from Cheng Lian-2:
The Thrift server is used to interact with existing Hive data, and thus
needs Hive Metastore to access Hive catalog. In your case, you need to build
Spark
I got tipped by an expert that the error of Unsupported language features
in query that I had was due to the fact that SparkSQL does not support
dynamic partitions, and I can do saveAsParquetFile() for each partition.
My inefficient implementation is to:
//1. run the query without DISTRIBUTE BY
58 matches
Mail list logo