Fwd: Problem with Execution plan using loop

2017-04-15 Thread Javier Rey
Hi guys, I have this situation: 1. Data frame with 22 columns 2. I need to add some columns (feature engineering) using existing columns, 12 columns will be add by each column in list. 3. I created a loop, but in the 5 item(col) on the loop this starts to go very slow in the join part, I can

Problem with Execution plan using loop

2017-04-15 Thread Javier Rey
Hi guys, I have this situation: 1. Data frame with 22 columns 2. I need to add some columns (feature engineering) using existing columns, 12 columns will be add by each column in list. 3. I created a loop, but in the 5 item(col) on the loop this starts to go very slow in the join part, I can

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha
What i missed is try increasing number of partitions using repartition On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote: > It does not look like scala vs python thing. How big is your audience data > store? Can it be broadcasted? > > What is the memory footprint you are

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha
It does not look like scala vs python thing. How big is your audience data store? Can it be broadcasted? What is the memory footprint you are seeing? At what point yarn is killing? Depeneding on that you may want to tweak around number of partitions of input dataset and increase number of

Join streams Apache Spark

2017-04-15 Thread tencas
Hi everybody, I am using Apache Spark Streaming using a TCP connector to receive data. I have a python application that connects to a sensor, and create a TCP server that waits connection from Apache Spark, and then, sends json data through this socket. How can I manage to join many independent