You can also look at the shuffle file cleanup tricks we do inside of the
ALS algorithm in Spark.
On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote:
> have you looked at
have you looked at
and the post mentioned there
also try compressing the output
it sure is not able to get sufficient resources from YARN to start the
is it only with this import job or if you submit any other job its failing
As a test just try to run another spark job or a mapredue job and see if
the job can be started.
Reduce the thrift server
Instead of spark-shell have you tried running it as a job.
how many executors and cores, can you share the RDD graph and event timeline
on the UI and did you find which of the tasks taking more time was they are
please look at the UI if not already it can provide lot of information
when HTTP connection is opened you are opening a connection between specific
machine (with IP and NIC card) to another specific machine, so this can't be
serialized and used on other machine right!!
This isn't spark limitation.
I made a simple diagram if it helps. The Objects created at driver
thanks for adding RDD lineage graph.
I could see 18 parallel tasks for HDFS Read was it changed.
what is the spark job configuration, how many executors and cores per
i would say keep the partitioning multiple of (no of executors * cores) for
all the RDD's
if you have 3 executors
I have a large dataset composed of scores for several thousand segments,
and the timestamps at which time those scores occurred. I'd like to apply
some techniques like reservoir sampling, where for every segment I
process records in order of their timestamps, generate a sample, and then
I am trying to read data from Kafka and ingest into Kudu using Spark
Streaming. I am not using KuduContext to perform the upsert operation into
kudu. Instead using Kudus native Client API to build the PartialRow and
applying the operation for every record from Kafka. I am able to run the
With the 1st approach , you will have to read the properties from the
--files using this below:
Or else , you can copy the file to hdfs , read it using sc.textFile and use
the property within it.
If you add files using --files , it gets copied to executor's
After reading the mnist example and the API of TensorflowOnSpark, I somehow got
confused, here are some questions:
1、 What's the relationship between TFCluster/TFManager/TFNode and TFSparkNode
2、The conversion guide says we should replace the main function with a
main_fun, but the
I am working with spark structured streaming (2.2.1) reading data from Kafka
I need to aggregate data ingested every minute and I am using spark-shell at
the moment. The message rate ingestion rate is approx 500k/second. During
some trigger intervals (1 minute) especially when
spark version - EMR 2.0.0
spark-shell --packages com.lucidworks.spark:spark-solr:3.0.1
when i tired about command, am getting below error
:: UNRESOLVED DEPENDENCIES ::
I am experimenting with Spark 2.3.0 stream-stream join feature to see if I
can leverage it to replace some of our existing services.
Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB
SSD). My input dataset which is in Kafka is about 250GB per day. Now I want
I have a few spark jobs that are doing the same aggregations. I want to
factorize the aggregation logic. For that I want to use a Trait.
When I run this job extending my Trait (over yarn, in client mode), I get
a NotSerializableException (in attachment).
If I change my Trait to an Object,
Mail list logo