2018-02-23 Thread Brindha Sengottaiyan

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread Holden Karau
You can also look at the shuffle file cleanup tricks we do inside of the ALS algorithm in Spark. On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote: > have you looked at > http://apache-spark-user-list.1001560.n3.nabble.com/Limit- > Spark-Shuffle-Disk-Usage-td23279.html > >

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread vijay.bvp
have you looked at http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html and the post mentioned there https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html also try compressing the output

Re: sqoop import job not working when spark thrift server is running.

2018-02-23 Thread vijay.bvp
it sure is not able to get sufficient resources from YARN to start the containers. is it only with this import job or if you submit any other job its failing to start. As a test just try to run another spark job or a mapredue job and see if the job can be started. Reduce the thrift server

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread vijay.bvp
Instead of spark-shell have you tried running it as a job. how many executors and cores, can you share the RDD graph and event timeline on the UI and did you find which of the tasks taking more time was they are any GC please look at the UI if not already it can provide lot of information

Re: Can spark handle this scenario?

2018-02-23 Thread vijay.bvp
when HTTP connection is opened you are opening a connection between specific machine (with IP and NIC card) to another specific machine, so this can't be serialized and used on other machine right!! This isn't spark limitation. I made a simple diagram if it helps. The Objects created at driver

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-23 Thread vijay.bvp
thanks for adding RDD lineage graph. I could see 18 parallel tasks for HDFS Read was it changed. what is the spark job configuration, how many executors and cores per exeuctor i would say keep the partitioning multiple of (no of executors * cores) for all the RDD's if you have 3 executors

Reservoir sampling in parallel

2018-02-23 Thread Patrick McCarthy
I have a large dataset composed of scores for several thousand segments, and the timestamps at which time those scores occurred. I'd like to apply some techniques like reservoir sampling[1], where for every segment I process records in order of their timestamps, generate a sample, and then at

Spark with Kudu behaving unexpectedly when bringing down the Kudu Service

2018-02-23 Thread ravidspark
Hi All, I am trying to read data from Kafka and ingest into Kudu using Spark Streaming. I am not using KuduContext to perform the upsert operation into kudu. Instead using Kudus native Client API to build the PartialRow and applying the operation for every record from Kafka. I am able to run the

Re: HBase connector does not read ZK configuration from Spark session

2018-02-23 Thread Deepak Sharma
Hi Dharmin With the 1st approach , you will have to read the properties from the --files using this below: SparkFiles.get('file.txt') Or else , you can copy the file to hdfs , read it using sc.textFile and use the property within it. If you add files using --files , it gets copied to executor's

What's relationship between the TensorflowOnSpark core modules?

2018-02-23 Thread xiaobo
Hi, After reading the mnist example and the API of TensorflowOnSpark, I somehow got confused, here are some questions: 1、 What's the relationship between TFCluster/TFManager/TFNode and TFSparkNode modules. 2、The conversion guide says we should replace the main function with a main_fun, but the

Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).  I need to aggregate data ingested every minute and I am using spark-shell at the moment.  The message rate ingestion rate is approx 500k/second.  During some trigger intervals (1 minute) especially when

Spark-Solr -- unresolved dependencies

2018-02-23 Thread Selvam Raman
Hi, spark version - EMR 2.0.0 spark-shell --packages com.lucidworks.spark:spark-solr:3.0.1 when i tired about command, am getting below error :: :: UNRESOLVED DEPENDENCIES :: :: ::

What happens if I can't fit data into memory while doing stream-stream join.

2018-02-23 Thread kant kodali
Hi All, I am experimenting with Spark 2.3.0 stream-stream join feature to see if I can leverage it to replace some of our existing services. Imagine I have 3 worker nodes with *each node* having (16GB RAM and 100GB SSD). My input dataset which is in Kafka is about 250GB per day. Now I want to do

NotSerializableException with Trait

2018-02-23 Thread Jean Rossier
Hello, I have a few spark jobs that are doing the same aggregations. I want to factorize the aggregation logic. For that I want to use a Trait. When I run this job extending my Trait (over yarn, in client mode), I get a NotSerializableException (in attachment). If I change my Trait to an Object,