If you can share an isolated example I'll take a look. Not something I've run into before.
On Wed, Jan 20, 2016 at 3:53 PM, jpocalan <jpoca...@gmail.com> wrote: > Hi, > > I have an application which creates a Kafka Direct Stream from 1 topic > having 5 partitions. > As a result each batch is composed of an RDD having 5 partitions. > In order to apply transformation to my batch I have decided to convert the > RDD to DataFrame (DF) so that I can easily add column to the initial DF by > using custom UDFs. > > Although, when I am applying any udf to the DF I am noticing that the udf > will get execute multiple times and this factor is driven by the number of > partitions. > For example, imagine I have a RDD with 10 records and 5 partitions ideally > my UDF should get called 10 times, although it gets consistently called 50 > times, but the resulting DF is correct and when executing a count() > properly > return 10, as expected. > > I have changed my code to work directly with RDDs using mapPartitions and > the transformation gets called proper amount of time. > > As additional information, I have set spark.speculation to false and no > tasks failed. > > I am working on a smaller example that would isolate this potential issue, > but in the meantime I would like to know if somebody encountered this > issue. > > Thank you. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Problem-with-DataFrame-UDFs-tp26024.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >