-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I have a simple application which fails with the following exception
only when the application is restarted (i.e. the checkpointDir has
entires from a previous execution):
Exception in thread main org.apache.spark.SparkException:
Hi,
I have a simple application which fails with the following exception
only when the application is restarted (i.e. the checkpointDir has
entires from a previous execution):
Exception in thread main org.apache.spark.SparkException:
org.apache.spark.streaming.dstream.ShuffledDStream@2264e43c
Hello everyone
I'm adopting the latest version of Apache Spark on my project, moving from
*1.2.x* to *1.3.x*, and the only significative incompatibility for now is
related to the *Row *class.
Any idea about what did happen to* org.apache.spark.sql.api.java.Row* class
in Apache Spark 1.3 ?
Hello,
in order to compute a huge dataset, the amount of columns to calculate the
covariance matrix is limited:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L129
What is the reason behind this limitation and can it
No, creating DF using createDataFrame won’t work:
val peopleDF = sqlContext.createDataFrame(people)
the code can be compiled but raised the same error as toDF at the line
above.
On Wed, May 13, 2015 at 6:22 PM Sebastian Alfers
Are you sure that you are submitting it correctly? Can you post the entire
command you are using to run the .jar file via spark-submit?
Ok, here it is:
/opt/spark-1.3.1-bin-hadoop2.6/bin/spark-submit
target/scala-2.11/webcat_2.11-1.0.jar
However, on the server somehow I have to specify main
Additionally, after I successfully packaged the code, and submitted
via spark-submit
webcat_2.11-1.0.jar, the following error was thrown at the line where toDF()
been called:
Exception in thread main java.lang.NoSuchMethodError:
From: http://spark.apache.org/docs/latest/streaming-kafka-integration.html
I'm trying to use the direct approach to read messages form Kafka.
Kafka is running as a cluster and configured with Zookeeper.
On the above page it mentions:
In the Kafka parameters, you must specify either
Hi ,
I am using Spark SQL 1.3.1.
I have created a dataFrame using jdbc data source and am using saveAsTable()
method but got the following 2 exceptions:
java.lang.RuntimeException: Unsupported datatype DecimalType()
at scala.sys.package$.error(package.scala:27)
at
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
You could retroactively union an existing DStream with one from a
newly created file. Then when another file is detected, you would
need to re-union the stream an create another DStream. It seems like
the implementation of FileInputDStream only
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
I would suggest creating one DStream per directory and then using
StreamingContext#union(...) to get a union DStream.
- -- Ankur
On 13/05/2015 00:53, hotdog wrote:
I want to use use fileStream in spark streaming to monitor multi
hdfs directories,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I have a simple application which fails with the following exception
only when the application is restarted (i.e. the checkpointDir has
entires from a previous execution):
Exception in thread main org.apache.spark.SparkException:
Earthson wrote
Finally, I've found two ways:
1. search the output with something like Submitted application
application_1416319392519_0115
2. use specific AppName. We could query the ApplicationID(yarn)
Hi Eathson,
Can you explain more about case 2? How can we query the ApplicationID by the
Hi,
I was running our graphx application(worked finely on Spark 1.2.0) but failed
on Spark 1.3.1 with below exception.
Anyone has idea on this issue? I guess it was caused by using LZ4 codec?
Exception in thread main org.apache.spark.SparkException: Job aborted due to
stage failure: Task 54
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven
requirements are much lower , around 1.7GB. So try using maven .
For reference I have the following settings and both do compile. sbt would
not work with lower values.
$echo $SBT_OPTS
-Xmx3012m -XX:MaxPermSize=512m
Hi,
I have some applications finished(but actually failed before), that in WebUI
show
Application myApp is still in progress.
and, in the eventlog folder, there are several log files like this:
app-20150512***.inprogress
So, I am wondering what the “inprogress” means…
Thanks! :)
Best,
Hi,
Please help me with below two issues:
*Environment:*
I am running my spark cluster in stand alone mode.
I am initializing the spark context from inside my tomcat server.
I am setting below properties in environment.sh in $SPARK_HOME/conf
directory
hi,
i use spark and flink in the same maven project,
now i get a exception on working with spark, flink work well
the problem are transitiv dependencies.
maybe somebody know a solution, or versions, which work together.
best regards
paul
ps: a cloudera maven repo flink would be desirable
my
Many thanks Cody and contributors for the help.
On Wed, May 13, 2015 at 3:44 PM, Cody Koeninger c...@koeninger.org wrote:
Either one will work, there is no semantic difference.
The reason I designed the direct api to accept both of those keys is
because they were used to define lists of
Hi,
I want to get *JavaPairRDDString, String *from the tuple part of
*JavaPairRDDString,
Tuple2String, String .*
As an example: (
http://www.koctas.com.tr/reyon/el-aletleri/7,(0,1,0,0,0,0,0,0,46551))
in my *JavaPairRDDString,
Tuple2String, String *and I want to get
*( (46551),
I tried to build Spark in my local machine Ubuntu 14.04 ( 4 GB Ram), my
system is getting hanged (freezed). When I monitered system processes, the
build process is found to consume 85% of my memory. Why does it need lot of
resources. Is there any efficient method to build Spark.
Thanks
Akhil
You can run the following command:
mvn dependency:tree
And see what jetty versions are brought in.
Cheers
On May 13, 2015, at 6:07 AM, Pa Rö paul.roewer1...@googlemail.com wrote:
hi,
i use spark and flink in the same maven project,
now i get a exception on working with spark, flink
I'm trying Kafka Direct approach (for consume) but when I use only this
config:
kafkaParams.put(group.id, groupdid);
kafkaParams.put(zookeeper.connect, zookeeperHostAndPort + /cb_kafka);
I get this
Exception in thread main org.apache.spark.SparkException: Must specify
metadata.broker.list or
My 2 cents: If you have Java 8, you don't need any extra settings for
Maven.
--
Emre Sevinç
On Wed, May 13, 2015 at 3:02 PM, Stephen Boesch java...@gmail.com wrote:
Hi Akhil, Building with sbt tends to need around 3.5GB whereas maven
requirements are much lower , around 1.7GB. So try using
Either one will work, there is no semantic difference.
The reason I designed the direct api to accept both of those keys is
because they were used to define lists of brokers in pre-existing Kafka
project apis. I don't know why the Kafka project chose to use 2 different
configuration keys.
On
in spark streaming, I want to use fileStream to monitor a directory. But the
files in that directory are compressed using lz4. So the new lz4 files are
not detected by the following code. How to detect these new files?
val list_join_action_stream = ssc.fileStream[LongWritable, Text,
As far as I can tell, Dibyendu's cons boil down to:
1. Spark checkpoints can't be recovered if you upgrade code
2. Some Spark transformations involve a shuffle, which can repartition data
It's not accurate to imply that either one of those things are inherently
cons of the direct stream api.
I understated that this port value is randomly selected.
Is there a way to enforce which spark port a Worker should use?
Hi guys,
If I load a dataframe via a sql context that has a SORT BY in the query and
I want to repartition the data frame will it keep the sort order in each
partition?
I want to repartition because I'm going to run a Map that generates lots of
data internally so to avoid Out Of Memory errors I
In my mind, this isn't really a producer vs consumer distinction, this is a
broker vs zookeeper distinction.
The producer apis talk to brokers. The low level consumer api (what direct
stream uses) also talks to brokers. The high level consumer api talks to
zookeeper, at least initially.
TLDR;
I believe what Dean Wampler was suggesting is to use the sqlContext not the
sparkContext (sc), which is where the createDataFrame function resides:
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.SQLContext
HTH.
-Todd
On Wed, May 13, 2015 at 6:00 AM, SLiZn Liu
Many thanks Cody!
On Wed, May 13, 2015 at 4:22 PM, Cody Koeninger c...@koeninger.org wrote:
In my mind, this isn't really a producer vs consumer distinction, this is
a broker vs zookeeper distinction.
The producer apis talk to brokers. The low level consumer api (what direct
stream uses)
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
shows setting up your stream and calling .checkpoint(checkpointDir) inside
the functionToCreateContext. It looks to me like you're setting up your
stream and calling checkpoint outside, after getOrCreate.
I'm not
Hi,
I am using spark-shell and the steps using which I can reproduce the issue
are as follows:
scala val dateDimDF=
sqlContext.load(jdbc,Map(url-jdbc:teradata://192.168.145.58/DBS_PORT=1025,DATABASE=BENCHQADS,LOB_SUPPORT=OFF,USER=
BENCHQADS,PASSWORD=abc,dbtable - date_dim))
scala
Thanks Cody for your email. I think my concern was not to get the ordering
of message within a partition , which as you said is possible if one knows
how Spark works. The issue is how Spark schedule jobs on every batch which
is not on the same order they generated. So if that is not guaranteed it
Looking at Consumer Configs in
http://kafka.apache.org/documentation.html#consumerconfigs
The properties *metadata.broker.list* or *bootstrap.servers *are not
mentioned.
Should I need these for consume side?
On Wed, May 13, 2015 at 3:52 PM, James King jakwebin...@gmail.com wrote:
Many thanks
You linked to a google mail tab, not a public archive, so I don't know
exactly which conversation you're referring to.
As far as I know, streaming only runs a single job at a time in the order
they were defined, unless you turn on an experimental option for more
parallelism (TD or someone more
Is python installed on the machine where you ran ./spark-ec2 ?
Cheers
On Wed, May 13, 2015 at 1:33 PM, Su She suhsheka...@gmail.com wrote:
I'm trying to set up my own cluster and am having trouble running this
script:
./spark-ec2 --key-pair=xx --identity-file=xx.pem --region=us-west-2
I do rdd.countApprox() and rdd.sparkContext.setJobGroup() inside
dstream.foreachRDD{...}. After calling cancelJobGroup(), the spark context
seems no longer valid, which crashes subsequent jobs.
My spark version is 1.1.1. I will do more investigation into this issue,
perhaps after upgrading to
Hi Friends,
please someone can give the idea, Ideally what should be time(complete job
execution) for spark job,
I have data in a hive table, amount of data would be 1GB , 2 lacs rows for
whole month,
I want to do monthly aggregation, using SQL queries,groupby
I have only one node,1
Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my classpath.
I’ve outlined the issue on Stack Overflow
(http://stackoverflow.com/questions/30221677/spark-sql-postgresql-data-source-issues).
I’m not sure how to fix this since I built the uber jar using sbt-assembly and
the
Hi,
I'm trying to run an application that uses a Hive context to perform some
queries over JSON files.
The code of the application is here:
https://github.com/GiovanniPaoloGibilisco/spark-log-processor/tree/fca93d95a227172baca58d51a4d799594a0429a1
I can run it on Spark 1.3.1 after rebuilding it
That is not supposed to happen :/ That is probably a bug.
If you have the log4j logs, would be good to file a JIRA. This may be worth
debugging.
On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo-inc.com wrote:
Actually I tried that before asking. However, it killed the spark context.
:-)
Du
I'm trying to set up my own cluster and am having trouble running this script:
./spark-ec2 --key-pair=xx --identity-file=xx.pem --region=us-west-2
--zone=us-west-2c --num-slaves=1 launch my-spark-cluster
based off: https://spark.apache.org/docs/latest/ec2-scripts.html
It just tries to open the
Hmm, just tried to run it again, but opened the script with python,
the cmd line seemed to pop up really quick and exited.
On Wed, May 13, 2015 at 2:06 PM, Su She suhsheka...@gmail.com wrote:
Hi Ted, Yes I do have Python 3.5 installed. I just ran py from the
ec2 directory and it started up the
You might get stage information through SparkListener. But I am not sure
whether you can use that information to easily kill stages.
Though i highly recommend using Spark 1.3.1 (or even Spark master). Things
move really fast between releases. 1.1.1 feels really old to me :P
TD
On Wed, May 13,
I 'm using streaming integrated with streaming-kafka.
My kafka topic has 80 partitions, while my machines have 40 cores. I found
that when the job is running, the kafka consumer processes are only deploy
to 2 machines, the bandwidth of the 2 machines will be very very high.
I wonder is there
With this lowlevel Kafka API
https://github.com/dibbhatt/kafka-spark-consumer/, you can actually
specify how many receivers that you want to spawn and most of the time it
spawns evenly, usually you can put a sleep just after creating the context
for the executors to connect to the driver and then
Your stack trace says it can't convert date to integer. You sure about
column positions?
On 13 May 2015 21:32, Ishwardeep Singh ishwardeep.si...@impetus.co.in
wrote:
Hi ,
I am using Spark SQL 1.3.1.
I have created a dataFrame using jdbc data source and am using
saveAsTable()
method but got
I believe most ports are configurable at this point, look at
http://spark.apache.org/docs/latest/configuration.html
search for .port
On Wed, May 13, 2015 at 9:38 AM, James King jakwebin...@gmail.com wrote:
I understated that this port value is randomly selected.
Is there a way to enforce
or you can use this Receiver as well :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
Where you can specify how many Receivers you need for your topic and it
will divides the partitions among the Receiver and return the joined stream
for you .
Say you specified 20 receivers , in
Is the $foo or mydf(foo) or both checked at compile time to verify that
the column reference is valid? Thx.
Dean
On Wednesday, May 13, 2015, Michael Armbrust mich...@databricks.com wrote:
I would not say that either method is preferred (neither is
old/deprecated). One advantage to the second
Hello Spark gurus,
We have a spark streaming application that is consuming from a Flume stream
and has some window operations. The input batch sizes are 1 minute and
intermediate Window operations have window sizes of 1 minute, 1 hour and 6
hours. I enabled checkpointing and Write ahead log so
Hi all,
I want use spark-streaming with flume ,now i am in truble, I don't
know how to configure the flume ,I use I configure flume like this :
a1.sources = r1
a1.channels = c1 c2
a1.sources.r1.type = avro
a1.sources.r1.channels = c1 c2
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port =
You could use a map() operation, but the easiest way is probably to just
call values() method on the JavaPairRDDA,B to get a JavaRDDB.
See this link:
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
Tristan
On 13 May 2015 at 23:12, Yasemin Kaya
I look into the Environment in both modes.
yarn-client:
spark.jars
local:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar,file:/home/xxx/my-app.jar
yarn-cluster:
spark.yarn.secondary.jars
local:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar
I
Hadoop version: CDH 5.4.
We need to connect to HBase, thus need extra
/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar
dependency.
It works in yarn-client mode:
spark-submit --class xxx.xxx.MyApp --master yarn-client --num-executors 10
--executor-memory 10g --jars
Thank you Tristan. It is totally what I am looking for :)
2015-05-14 5:05 GMT+03:00 Tristan Blakers tris...@blackfrog.org:
You could use a map() operation, but the easiest way is probably to just
call values() method on the JavaPairRDDA,B to get a JavaRDDB.
See this link:
Hi,
Using spark sql with HiveContext. Spark version is 1.3.1
When running local spark everything works fine. When running on spark
cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
This class belongs to hive-shims-0.23, and is a runtime dependency for
spark-hive:
Hi all,
Quote Inside a given Spark application (SparkContext instance), multiple
parallel jobs can run simultaneously if they were submitted from separate
threads.
How to run multiple jobs in one SPARKCONTEXT using separate threads in pyspark?
I found some examples in scala and java, but
I'm just getting started with Spark SQL and DataFrames in 1.3.0.
I notice that the Spark API shows a different syntax for referencing
columns in a dataframe than the Spark SQL Programming Guide.
For instance, the API docs for the select method show this:
df.select($colA, $colB)
Whereas the
Hi all,
I'm experimenting with Spark's Word2Vec implementation for a relatively
large (5B word, vocabulary size 4M) corpora. Has anybody had success
running it at this scale?
-Shilad
--
Shilad W. Sen
Associate Professor
Mathematics, Statistics, and Computer Science Dept.
Macalester College
I cherry-picked this commit into my local 1.2 branch. It fixed the problem
with setting spark.serializer, but I get a similar problem with
spark.closure.serializer
org.apache.spark.SparkException: Failed to register classes with Kryo
at
Sorry for missing that in the upgrade guide. As part of unifying the Java
and Scala interfaces we got rid of the java specific row. You are correct
in assuming that you want to use row in org.apache.spark.sql from both
Scala and Java now.
On Wed, May 13, 2015 at 2:48 AM, Emerson Castañeda
Hi TD,
Thanks a lot. rdd.countApprox(5000).initialValue worked! Now my streaming app
is standing a much better chance to complete processing each batch within the
batch interval.
Du
On Tuesday, May 12, 2015 10:31 PM, Tathagata Das t...@databricks.com
wrote:
From the code it seems
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi all,
I want to get a general idea about best practices around data format
serialization. Currently, I am using avro as the data serialization
format but the emitted types aren't very scala friendly. So, I was
wondering how others deal with this
Sander,
I eventually solved this problem via the --[no-]switch_user flag, which is set
to true by default. I set this to false, which would have the user that owns
the process run the job, otherwise it was my username (scarman)
running the job, which would fail because obviously my username
Hi Stephen,
You probably didn't run the Spark driver/shell as root, as Mesos scheduler
will pick up your local user and tries to impersonate as the same user and
chown the directory before executing any task.
If you try to run Spark driver as root it should resolve the problem. No
switch user
Yup, exactly as Tim mentioned on it too. I went back and tried what you just
suggested and that was also perfectly fine.
Steve
On May 13, 2015, at 1:58 PM, Tim Chen
t...@mesosphere.iomailto:t...@mesosphere.io wrote:
Hi Stephen,
You probably didn't run the Spark driver/shell as root, as Mesos
Hi all,
I'm experimenting with Spark's Word2Vec implementation for a relatively
large (5B word, vocabulary size 4M, 400-dimensional vectors) corpora. Has
anybody had success running it at this scale?
Thanks in advance for your guidance!
-Shilad
--
Shilad W. Sen
Associate Professor
Indeed, many thanks.
On Wednesday, 13 May 2015, Cody Koeninger c...@koeninger.org wrote:
I believe most ports are configurable at this point, look at
http://spark.apache.org/docs/latest/configuration.html
search for .port
On Wed, May 13, 2015 at 9:38 AM, James King jakwebin...@gmail.com
Hi TD,
Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout?
Otherwise it keeps running until completion, producing results not used but
consuming resources.
Thanks,Du
On Wednesday, May 13, 2015 10:33 AM, Du Li l...@yahoo-inc.com.INVALID
wrote:
Hi TD,
Easiest way is to broadcast it.
On 13 May 2015 10:40, Charles Hayden charles.hay...@atigeo.com wrote:
In pySpark, I am writing a map with a lambda that calls random.shuffle.
For testing, I want to be able to give it a seed, so that successive runs
will produce the same shuffle.
I am looking
Hi Hao: I tried broadcastjoin with following steps, and found that my
query is still running slow ,not very sure if I'm doing right with
broadcastjoin:1.add spark.sql.autoBroadcastJoinThreshold 104857600(100MB)
in conf/spark-default.conf. 100MB is larger than any of my 2 tables.2.start
74 matches
Mail list logo