subject:"foreachPartition"

[structured-streaming] foreachPartition alternative in structured streaming.

2018-05-17 Thread karthikjay

e: I create connection object inside foreachpartition. How do I do this in Structured Streaming ? I tried connection pooling approach (where I create a pool of connections on the master node and pass it to worker nodes ) here <https://stackoverflow.com/questions/50205650/spark-connection-

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova

: Re: How to properly execute `foreachPartition` in Spark 2.2 Spark Dataset / Dataframe has foreachPartition() as well. Its implementation is much more efficient than RDD's. There is ton of code snippets, say https://github.com/hdinsight/spark-streaming-data-persistence-examples/blob/maste

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Cody Koeninger

fka, then > writeStream to Kafka? > > > > > > *From: *Liana Napalkova > *Date: *Monday, December 18, 2017 at 10:07 AM > *To: *Silvio Fiorito , " > user@spark.apache.org" > > *Subject: *Re: How to properly execute `foreachPartition` in Spark 2.2 > > > &

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova

If there is no other way, then I will follow this recommendation. From: Silvio Fiorito Sent: 18 December 2017 16:20:03 To: Liana Napalkova; user@spark.apache.org Subject: Re: How to properly execute `foreachPartition` in Spark 2.2 Couldn’t you readStream from

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Silvio Fiorito

Subject: Re: How to properly execute `foreachPartition` in Spark 2.2 I need to firstly read from Kafka queue into a DataFrame. Then I should perform some transformations with the data. Finally, for each row in the DataFrame I should conditionally apply KafkaProducer in order to send some dat

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Timur Shenkao

Spark Dataset / Dataframe has foreachPartition() as well. Its implementation is much more efficient than RDD's. There is ton of code snippets, say https://github.com/hdinsight/spark-streaming-data-persistence-examples/blob/master/src/main/scala/com/microsoft/spark/streaming/examples/c

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova

. From: Silvio Fiorito Sent: 18 December 2017 16:00:39 To: Liana Napalkova; user@spark.apache.org Subject: Re: How to properly execute `foreachPartition` in Spark 2.2 Why don’t you just use the Kafka sink for Spark 2.2? https://spark.apache.org/docs/2.2.0

Re: How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Silvio Fiorito

ct: How to properly execute `foreachPartition` in Spark 2.2 Hi, I wonder how to properly execute `foreachPartition` in Spark 2.2. Below I explain the problem is details. I appreciate any help. In Spark 1.6 I was doing something similar to this: DstreamFromKafka.foreachR

How to properly execute `foreachPartition` in Spark 2.2

2017-12-18 Thread Liana Napalkova

Hi, I wonder how to properly execute `foreachPartition` in Spark 2.2. Below I explain the problem is details. I appreciate any help. In Spark 1.6 I was doing something similar to this: DstreamFromKafka.foreachRDD(session => { session.foreachPartition { partitionOfReco

Re: foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko

< kravchenko.anto...@gmail.com> wrote: > Ok, there are at least two ways to do it: > Dataset df = spark.read.csv("file:///C:/input_data/*.csv") > > df.foreachPartition(new ForEachPartFunction()); > df.toJavaRDD().foreachPartition(new Void_java_func()); > > where ForEachPartFu

Re: foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko

Ok, there are at least two ways to do it: Dataset df = spark.read.csv("file:///C:/input_data/*.csv") df.foreachPartition(new ForEachPartFunction()); df.toJavaRDD().foreachPartition(new Void_java_func()); where ForEachPartFunction and Void_java_func are defined below: // ForEachPartFun

foreachPartition in Spark Java API

2017-05-30 Thread Anton Kravchenko

What would be a Java equivalent of the Scala code below? def void_function_in_scala(ipartition: Iterator[Row]): Unit ={ var df_rows=ArrayBuffer[String]() for(irow<-ipartition){ df_rows+=irow.toString } val df = spark.read.csv("file:///C:/input_data/*.csv") d

Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan

foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine)

Foreachpartition in spark streaming

2017-03-19 Thread Diwakar Dhanuskodi

Just wanted to clarify!!! Is foreachPartition in spark an output operation? Which one is better use mapPartitions or foreachPartitions? Regards Diwakar

Re: Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-16 Thread Cody Koeninger

; value from tuple2._2() for JavaDStream as in > > return tuple2._2(); > > The returned JavaDStream is then processed by foreachPartition, which is > wrapped by foreachRDD. > > foreachPartition's call function does Iterator on the RDD as in > inputRDD.next (); > > When

Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-15 Thread Michael Nguyen

I have the following sequence of Spark Java API calls (Spark 2.0.2): 1. Kafka stream that is processed via a map function, which returns the string value from tuple2._2() for JavaDStream as in return tuple2._2(); 1. The returned JavaDStream is then processed by foreachPartition

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath

) partKeyFileRDD = keyFileRDD.repartition(16) Looking again at the UI, this file has 16 partitions now (all on the same executor). When the forEachPartition runs, this then uses these 16 partitions (all on the same executor). I think this is really the problem. I'm not sure why the repartition didn

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath

= partKeyFileRDD.mapToPair(new SimpleStorageServiceAsset()); The worker then has the following. The issue I believe is that the following log.info statements only appear in the log file for one of my executors (and not both). In other words, when executing the forEachPartition above, Spark appears to think all of

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski

Hi, Could you share the code with foreachPartition? Jacek 11.03.2016 7:33 PM "Darin McBeath" napisał(a): > > > I can verify this by looking at the log file for the workers. > > Since I output logging statements in the object called by the > foreachPartition, I

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath

I can verify this by looking at the log file for the workers. Since I output logging statements in the object called by the foreachPartition, I can see the statements being logged. Oddly, these output statements only occur in one executor (and not the other). It occurs 16 times in this

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Jacek Laskowski

Hi, How do you check which executor is used? Can you include a screenshot of the master's webUI with workers? Jacek 11.03.2016 6:57 PM "Darin McBeath" napisał(a): > I've run into a situation where it would appear that foreachPartition is > only running on one of

spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath

I've run into a situation where it would appear that foreachPartition is only running on one of my executors. I have a small cluster (2 executors with 8 cores each). When I run a job with a small file (with 16 partitions) I can see that the 16 partitions are initialized but they all appe

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread Sathish Kumaran Vairavelu

I think you can use mapPartitions that returns PairRDDs followed by forEachPartition for saving it On Wed, Nov 18, 2015 at 9:31 AM swetha kasireddy wrote: > Looks like I can use mapPartitions but can it be done using > forEachPartition? > > On Tue, Nov 17, 2015 at 11:51 PM, swe

Re: How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-18 Thread swetha kasireddy

Looks like I can use mapPartitions but can it be done using forEachPartition? On Tue, Nov 17, 2015 at 11:51 PM, swetha wrote: > Hi, > > How to return an RDD of key/value pairs from an RDD that has > foreachPartition applied. I have my code something like the following. It > lo

How to return a pair RDD from an RDD that has foreachPartition applied?

2015-11-17 Thread swetha

Hi, How to return an RDD of key/value pairs from an RDD that has foreachPartition applied. I have my code something like the following. It looks like an RDD that has foreachPartition can have only the return type as Unit. How do I apply foreachPartition and do a save and at the same return a pair

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-17 Thread Nipun Arora

ordering. > > You typically want to acquire resources inside the foreachpartition > closure, just before handling the iterator. > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd > > On Mon, Nov 16, 2015 at 4:02 P

Re: [SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Cody Koeninger

Ordering would be on a per-partition basis, not global ordering. You typically want to acquire resources inside the foreachpartition closure, just before handling the iterator. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Mon, Nov

[SPARK STREAMING] Questions regarding foreachPartition

2015-11-16 Thread Nipun Arora

Hi, I wanted to understand forEachPartition logic. In the code below, I am assuming the iterator is executing in a distributed fashion. 1. Assuming I have a stream which has timestamp data which is sorted. Will the stringiterator in foreachPartition process each line in order? 2. Assuming I have

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky

:42 PM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> I'm just trying to do some operation inside foreachPartition, but I can't >> even get a simple println to work. Nothing gets printed. >> >> scala> val a = sc.parallelize(List(1,2,3))

Re: foreachPartition

2015-10-30 Thread Mark Hamstra

The closure is sent to and executed an Executor, so you need to be looking at the stdout of the Executors, not on the Driver. On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > I'm just trying to do some operation inside foreachPartitio

foreachPartition

2015-10-30 Thread Alex Nastetsky

I'm just trying to do some operation inside foreachPartition, but I can't even get a simple println to work. Nothing gets printed. scala> val a = sc.parallelize(List(1,2,3)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21 scala> a.foreachPartit

Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru

hich is holding >>> us from going live. >>> >>> We have a job that carries out computation on log files and write the >>> results into Oracle DB. >>> >>> The reducer 'reduceByKey' have been set to parallelize by 4 as we don't

Re: Performance issue with Spak's foreachpartition method

2015-07-24 Thread Bagavath

ut computation on log files and write the >> results into Oracle DB. >> >> The reducer 'reduceByKey' have been set to parallelize by 4 as we don't >> want to establish too many DB connections. >> >> We are then calling the foreachPartition on the RDD pa

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru

g files and write the > results into Oracle DB. > > The reducer 'reduceByKey' have been set to parallelize by 4 as we don't > want to establish too many DB connections. > > We are then calling the foreachPartition on the RDD pairs that were > reduced by the key. With

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East

en set to parallelize by 4 as we don't want > to establish too many DB connections. > > We are then calling the foreachPartition on the RDD pairs that were reduced > by the key. Within this foreachPartition method we establish DB connection, > then iterate the results, prepare

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru

o establish too many DB connections. We are then calling the foreachPartition on the RDD pairs that were reduced by the key. Within this foreachPartition method we establish DB connection, then iterate the results, prepare the Oracle statement for batch insertion then we commit the batch and close

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Dmitry Goldenberg

Wednesday, June 3, 2015 2:44 PM > *To:* Evo Eftimov > *Cc:* dgoldenberg; user > *Subject:* Re: Objects serialized before foreachRDD/foreachPartition ? > > > > Considering memory footprint of param as mentioned by Dmitry, option b > seems better. > > > > Cheers &g

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov

: dgoldenberg; user Subject: Re: Objects serialized before foreachRDD/foreachPartition ? Considering memory footprint of param as mentioned by Dmitry, option b seems better. Cheers On Wed, Jun 3, 2015 at 6:27 AM, Evo Eftimov wrote: Hmmm a spark streaming app code doesn't execute i

Re: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Ted Yu

l.com] > Sent: Wednesday, June 3, 2015 1:56 PM > To: user@spark.apache.org > Subject: Objects serialized before foreachRDD/foreachPartition ? > > I'm looking at https://spark.apache.org/docs/latest/tuning.html. > Basically > the takeaway is that all objects passed into the

RE: Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread Evo Eftimov

mail.com] Sent: Wednesday, June 3, 2015 1:56 PM To: user@spark.apache.org Subject: Objects serialized before foreachRDD/foreachPartition ? I'm looking at https://spark.apache.org/docs/latest/tuning.html. Basically the takeaway is that all objects passed into the code processing RDD's mu

Objects serialized before foreachRDD/foreachPartition ?

2015-06-03 Thread dgoldenberg

ssing the RDD's, I'd need to think twice about the costs of serializing such objects, it would seem. In the below, does the Spark serialization happen before calling foreachRDD or before calling foreachPartition? Param param = new Param(); param.initialize(); messageBodies

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Archit Thakur

True. On Mon, Apr 20, 2015 at 4:14 PM, Arun Patel wrote: > mapPartitions is a transformation and foreachPartition is a an action? > > Thanks > Arun > > On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur > wrote: > >> The same, which is between map and forea

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel

mapPartitions is a transformation and foreachPartition is a an action? Thanks Arun On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur wrote: > The same, which is between map and foreach. map takes iterator returns > iterator foreach takes iterator returns Unit. > > On Mon, Apr 20, 201

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Archit Thakur

The same, which is between map and foreach. map takes iterator returns iterator foreach takes iterator returns Unit. On Mon, Apr 20, 2015 at 4:05 PM, Arun Patel wrote: > What is difference between mapPartitions vs foreachPartition? > > When to use these? > > Thanks, > Arun >

mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel

What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Re: Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread Sean Owen

} > } > > Unfortunately this doesn't work as I can't seem to be able to access the > SparkContext from anywhere within /foreachPartition()/. The code above > throws a NullPointerException, but if I change it to ssc.sc.makeRDD (where > ssc i

Creating RDDs from within foreachPartition() [Spark-Streaming]

2015-02-18 Thread t1ny

the SparkContext from anywhere within /foreachPartition()/. The code above throws a NullPointerException, but if I change it to ssc.sc.makeRDD (where ssc is the StreamingContext object created in the main function, outside of /foreachPartition/) then I get a NotSerializableException. What is the

Re: Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer

Hi again, On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer wrote: > > tmpRdd.foreachPartition(iter => { > iter.map(item => { > println("xyz: " + item) > }) > }) > Uh, with iter.foreach(...) it works... the reason being apparently that iter.map() re

Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer

e output, I see only the "abc" prints (i.e. from the foreach() call). (The result is the same also if I exchange the order.) What exactly is the meaning of foreachPartition and how would I use it correctly? Thanks Tobias

Re: NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread Shixiong Zhu

s > countersPublishers.foreachRDD(rdd => rdd.collect().foreach(r => > dbActorUpdater ! updateDBMessage(r))) > > There is no problem. I think something is misconfigured > Thanks for help > > > > > -- > View this message in context: > http://apache-spark-user-list.100

NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread richiesgr

this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-cluster-mode-when-using-foreachPartition-tp20719.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Re: foreachPartition and task status

2014-10-14 Thread Salman Haq

On Tue, Oct 14, 2014 at 12:42 PM, Sean McNamara wrote: > Are you using spark streaming? > > No, not at this time.

Re: foreachPartition and task status

2014-10-14 Thread Sean McNamara

Are you using spark streaming? On Oct 14, 2014, at 10:35 AM, Salman Haq wrote: > Hi, > > In my application, I am successfully using foreachPartition to write large > amounts of data into a Cassandra database. > > What is the recommended practice if the application want

foreachPartition and task status

2014-10-14 Thread Salman Haq

Hi, In my application, I am successfully using foreachPartition to write large amounts of data into a Cassandra database. What is the recommended practice if the application wants to know that the tasks have completed for all partitions? Thanks, Salman

Re: foreachPartition: write to multiple files

2014-10-08 Thread david

Hi, I finally found a solution after reading the post : http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/foreachPartition-write-to

foreachPartition: write to multiple files

2014-10-08 Thread david

Hi, I want to write my RDDs to multiples files based on a key value. So, i used groupByKey and iterate over partitions. Here is a the code : rdd.map(f => (f.substring(0,4), f)).groupByKey().foreachPartition(iterator => iterator.map { case (key, values) => val fs: F

56 matches

Mail list logo