Hi,
I'm using pyspark (1.6.2) to do a little bit of ETL and have noticed a very
odd situation. I have two dataframes, base and updated. The "updated"
dataframe contains constrained subset of data from "base" that I wish to
excluded. Something like this.
updated = base.where(base.X = F.lit(1000
Hi,
I have this simple scala app which works fine when i run it as scala
application from the scala IDE for eclipse.
But when i export is as jar and run it from spark-submit i am getting below
error. Please suggest
*bin/spark-submit --class com.x.y.vr.spark.first.SimpleApp test.jar*
16/09/24 23:1
Hello, everybody!
May be it's not a reason of your problem, but I've noticed the line in your
commentaries:
*java version "1.8.0_51"*
It's strongly advised to use Java 1.8.0_66+
I use even Java 1.8.0_101
On Tue, Sep 20, 2016 at 1:09 AM, janardhan shetty
wrote:
> Yes Sujit I have tried that op
As Cody said, Spark is not going to help you here.
There are two issues you need to look at here: duplicated (or even more)
messages processed by two different processes and the case of failure of any
component (including the message broker). Keep in mind that duplicated messages
can even occur
I am trying to prototype using a single instance SqlContext and use it toappend
Dataframes,partition by a field, to the same HDFS folder from multiple threads.
(Each thread will work with a DataFrame having different partition column
value.)
I get the exception16/09/24 16:45:12 ERROR [ForkJoinP
Spark alone isn't going to solve this problem, because you have no reliable
way of making sure a given worker has a consistent shard of the messages
seen so far, especially if there's an arbitrary amount of delay between
duplicate messages. You need a DHT or something equivalent.
On Sep 24, 2016
We have too many (large) files. We have about 30k partitions with about 4
years worth data and we need to process entire history in a one time
monolithic job.
I would like to know how spark decides the number of executors requested.
I've seen testcases where the max executors count is Integer's M
Hi Dan,
If you use spark <= 1.6, you can also do
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
to quickly link the spark-csv jars to spark shell. Otherwise as Holden
suggested you link it in your maven/sbt dependencies. Spark guys assume
that their users have a good
Do you have too many small files you are trying to read? Number of
executors are very high
On 24 Sep 2016 10:28, "Yash Sharma" wrote:
> Have been playing around with configs to crack this. Adding them here
> where it would be helpful to others :)
> Number of executors and timeout seemed like the