Dataset count on database or parquet

Rohit Verma Wed, 08 Feb 2017 03:00:02 -0800

Hi Which of the following is better approach for too many values in database



      final Dataset<Row> dataset = spark.sqlContext().read()
                .format("jdbc")
                .option("url", params.getJdbcUrl())
                .option("driver", params.getDriver())
                .option("dbtable", params.getSqlQuery())
//                .option("partitionColumn", hashFunction)
//                .option("lowerBound", 0)
//                .option("upperBound", 10)
//                .option("numPartitions", 10)
//                .option("oracle.jdbc.timezoneAsRegion", "false")
                .option("fetchSize", 100000)
                .load();
        dataset.write().parquet(params.getPath());

// target is to get count of persisted rows.


        // approach 1 i.e getting count directly from dataset
        // as I understood this count will be transalted to jdbcRdd.count and 
could be on database
        long count = dataset.count();
        //approach 2 i.e read back saved parquet and get count from it.
        long count = spark.read().parquet(params.getPath()).count();


Regards
Rohit

Dataset count on database or parquet

Reply via email to