Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Jonathan Gray
I think I know understand what the problem is and it is, in some ways, to
do with partitions and, in other ways, to do with memory.

I now think that the database write was not the source of the problem (the
problem being end-to-end performance).

The application reads rows from a database, does some simple derivations
then a group by and sum and joins the results back onto the original
dataframe and finally writes back to a database (at the same granularity as
the read).

I would run for 100,000 rows locally and end-to-end performance was
reasonable if I went up to 1,000,000 rows performance was far worse than
the increase in volume would suggest.  Profiling seemed to indicate that
the read and write were still happening efficiently in both cases but
nothing much was happening in between.  Looking at the profiler during this
period indicated that most of the time was being spent in a
Platform.copyMemory(...).  As a result I increased the heap and driver
memory and the performance improved dramatically.

It turns out that the groupBy key results in a bad distribution where 90%
of the data ends up in one partition and at the point of write out I'm
guessing that is when Spark decides it has to copy all of that data onto
one partition/worker.  Increasing the memory available appears to give it
the ability to do that.  The app has several cache() calls in it to get
around a codegen issue which probably puts more pressure on memory usage
and compounds the problem.

Still, I have learnt that Spark is excellent in it's ability to be lazy and
optimize across the whole end-to-end process from read-to-write.

On 21 April 2016 at 13:46, Michael Segel <msegel_had...@hotmail.com> wrote:

> How many partitions in your data set.
>
> Per the Spark DataFrameWritetr Java Doc:
> “
>
> Saves the content of the DataFrame
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrame.html>
> to a external database table via JDBC. In the case the table already exists
> in the external database, behavior of this function depends on the save
> mode, specified by the mode function (default to throwing an exception).
>
> Don't create too many partitions in parallel on a large cluster; otherwise
> Spark might crash your external database systems.
>
> “
>
> This implies one connection per partition writing in parallel. So you
> could be swamping your database.
> Which database are you using?
>
> Also, how many hops?
> Network latency could also impact performance too…
>
> On Apr 19, 2016, at 3:14 PM, Jonathan Gray <jonny.g...@gmail.com> wrote:
>
> Hi,
>
> I'm trying to write ~60 million rows from a DataFrame to a database using
> JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>
> The write seems to not be performing well.  Profiling the application with
> a master of local[*] it appears there is not much socket write activity and
> also not much CPU.
>
> I would expect there to be an almost continuous block of socket write
> activity showing up somewhere in the profile.
>
> I can see that the top hot method involves
> apache.spark.unsafe.platform.CopyMemory all from calls within
> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
> stressed so I'm guessing this isn't the cause of the problem.
>
> Is there any best practices or has anyone come across a case like this
> before where a write to a database seems to perform poorly?
>
> Thanks,
> Jon
>
>
>


Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Jonathan Gray
I tried increasing the batch size (1000 to 10,000 to 100,000) but it didn't
appear to make any appreciable difference in my test case.

In addition I had read in the Oracle JDBC documentation that batches should
be set between 10 and 100 and anything out of that range was not advisable.
However, I don't have any evidence to prove or disprove that.
On 21 Apr 2016 6:16 am, "Takeshi Yamamuro" <linguin@gmail.com> wrote:

> Sorry to wrongly send message in mid.
> How about trying to increate 'batchsize` in a jdbc option to improve
> performance?
>
> // maropu
>
> On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> Hi,
>>
>> How about trying to increate 'batchsize
>>
>> On Wed, Apr 20, 2016 at 7:14 AM, Jonathan Gray <jonny.g...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to write ~60 million rows from a DataFrame to a database
>>> using JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)
>>>
>>> The write seems to not be performing well.  Profiling the application
>>> with a master of local[*] it appears there is not much socket write
>>> activity and also not much CPU.
>>>
>>> I would expect there to be an almost continuous block of socket write
>>> activity showing up somewhere in the profile.
>>>
>>> I can see that the top hot method involves
>>> apache.spark.unsafe.platform.CopyMemory all from calls within
>>> JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
>>> stressed so I'm guessing this isn't the cause of the problem.
>>>
>>> Is there any best practices or has anyone come across a case like this
>>> before where a write to a database seems to perform poorly?
>>>
>>> Thanks,
>>> Jon
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Spark 1.6.1 DataFrame write to JDBC

2016-04-19 Thread Jonathan Gray
Hi,

I'm trying to write ~60 million rows from a DataFrame to a database using
JDBC using Spark 1.6.1, something similar to df.write().jdbc(...)

The write seems to not be performing well.  Profiling the application with
a master of local[*] it appears there is not much socket write activity and
also not much CPU.

I would expect there to be an almost continuous block of socket write
activity showing up somewhere in the profile.

I can see that the top hot method involves
apache.spark.unsafe.platform.CopyMemory all from calls within
JdbcUtils.savePartition(...).  However, the CPU doesn't seem particularly
stressed so I'm guessing this isn't the cause of the problem.

Is there any best practices or has anyone come across a case like this
before where a write to a database seems to perform poorly?

Thanks,
Jon