Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional data source during the data processing, main goal of spark is still analyzing data from HDFS-like file system.
to use spark as a data integration tool to transfer billions of records from RDBMS to HDFS etc. could work, but may not be the best tool... Sqoop with --direct sounds better, but the configuration costs, sqoop should be used for regular data integration tasks. not sure if your client need transfer billions of records periodically, if it is only an initial load, for such an one-off task, maybe a bash script with COPY command is more easier and faster :) Best, Teng 2016-10-18 4:24 GMT+02:00 Ninad Shringarpure <ni...@cloudera.com>: > > Hi Team, > > One of my client teams is trying to see if they can use Spark to source > data from RDBMS instead of Sqoop. Data would be substantially large in the > order of billions of records. > > I am not sure reading the documentations whether jdbcRDD by design is > going to be able to scale well for this amount of data. Plus some in-built > features provided in Sqoop like --direct might give better performance than > straight up jdbc. > > My primary question to this group is if it is advisable to use jdbcRDD for > data sourcing and can we expect it to scale. Also performance wise how > would it compare to Sqoop. > > Please let me know your thoughts and any pointers if anyone in the group > has already implemented it. > > Thanks, > Ninad > >