If you can share an isolated example I'll take a look.  Not something I've
run into before.

On Wed, Jan 20, 2016 at 3:53 PM, jpocalan <jpoca...@gmail.com> wrote:

> Hi,
>
> I have an application which creates a Kafka Direct Stream from 1 topic
> having 5 partitions.
> As a result each batch is composed of an RDD having 5 partitions.
> In order to apply transformation to my batch I have decided to convert the
> RDD to DataFrame (DF) so that I can easily add column to the initial DF by
> using custom UDFs.
>
> Although, when I am applying any udf to the DF I am noticing that the udf
> will get execute multiple times and this factor is driven by the number of
> partitions.
> For example, imagine I have a RDD with 10 records and 5 partitions ideally
> my UDF should get called 10 times, although it gets consistently called 50
> times, but the resulting DF is correct and when executing a count()
> properly
> return 10, as expected.
>
> I have changed my code to work directly with RDDs using mapPartitions and
> the transformation gets called proper amount of time.
>
> As additional information, I have set spark.speculation to false and no
> tasks failed.
>
> I am working on a smaller example that would isolate this potential issue,
> but in the meantime I would like to know if somebody encountered this
> issue.
>
> Thank you.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Problem-with-DataFrame-UDFs-tp26024.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to