Re: Optimized way to use spark as db to hdfs etl

Sabarish Sasidharan Sun, 06 Nov 2016 08:26:19 -0800

Pls be aware that Accumulators involve communication back with the driver
and may not be efficient. I think OP wants some way to extract the stats
from the sql plan if it is being stored in some internal data structure


Regards
Sab

On 5 Nov 2016 9:42 p.m., "Deepak Sharma" <deepakmc...@gmail.com> wrote:

> Hi Rohit
> You can use accumulators and increase it on every record processing.
> At last you can get the value of accumulator on driver , which will give
> you the count.
>
> HTH
> Deepak
>
> On Nov 5, 2016 20:09, "Rohit Verma" <rohit.ve...@rokittech.com> wrote:
>
>> I am using spark to read from database and write in hdfs as parquet file.
>> Here is code snippet.
>>
>> private long etlFunction(SparkSession spark){
>> spark.sqlContext().setConf("spark.sql.parquet.compression.codec",
>> “SNAPPY");
>> Properties properties = new Properties();
>> properties.put("driver”,”oracle.jdbc.driver");
>> properties.put("fetchSize”,”5000");
>> Dataset<Row> dataset = spark.read().jdbc(jdbcUrl, query, properties);
>> dataset.write.format(“parquet”).save(“pdfs-path”);
>> return dataset.count();
>> }
>>
>> When I look at spark ui, during write I have stats of records written,
>> visible in sql tab under query plan.
>>
>> While the count itself is a heavy task.
>>
>> Can someone suggest best way to get count in most optimized way.
>>
>> Thanks all..
>>
>

Re: Optimized way to use spark as db to hdfs etl

Reply via email to