Hi, Yea, spark does not have the same functionality with sqoop. I think one of simple solutions is to assign unique ids on the oracle table by yourself. Thought?
// maropu On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Strange that Oracle table of 200Million plus rows has not been partitioned. > > What matters here is to have parallel connections from JDBC to Oracle, > each reading a sub-set of table. Any parallel fetch is going to be better > than reading with one connection from Oracle. > > Surely among 404 columns there must be one with high cardinality to > satisfy this work. > > May be you should just create table <small> as select * from Oracle_table > where rownum <= 1000000; and use that for test. > > Other alternative is to use Oracle SQL Connecter for HDFS > <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that > can do it for you. With 404 columns it is difficult to suggest any > alternative. Is this a FACT table? > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com> wrote: > >> Hello Everyone, >> >> My goal is to use Spark Sql to load huge amount of data from Oracle to >> HDFS. >> >> *Table in Oracle:* >> 1) no primary key. >> 2) Has 404 columns. >> 3) Has 200,800,000 rows. >> >> *Spark SQL:* >> In my Spark SQL I want to read the data into n number of partitions in >> parallel, for which I need to provide 'partition column','lowerBound', >> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has >> no such column to satisfy this need(Highly Skewed), because of it, if the >> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task >> finishes in 20 mins and the last one takes forever. >> >> Is there anything I could do to distribute the data evenly into >> partitions? Can we set any fake query to orchestrate this pull process, as >> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS >> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ? >> >> Any pointers are appreciated. >> >> Thanks for your time. >> >> ~ Ajay >> > > -- --- Takeshi Yamamuro