There is also another jdbc method in data frame reader api o specify your own predicates for each partition. Using this you can control what is included in each partition.
val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 200") val df = spark.read.jdbc( urlWithUserAndPass, "TEST.PEOPLE", predicates = jdbcPartitionWhereClause, new Properties()) Hope that helps. -suresh > On Sep 13, 2016, at 9:44 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com> > wrote: > > Trust me, Only thing that can help you in your situation is SQOOP oracle > direct connector which is known as ORAOOP. Spark cannot do everything , > you need a OOZIE workflow which will trigger sqoop job with oracle direct > connector to pull the data then spark batch to process . > > Hope it helps !! > > On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic <igor.ra...@gmail.com > <mailto:igor.ra...@gmail.com>> wrote: > Hi, > > One way can be to use NTILE function to partition data. > Example: > > REM Creating test table > create table Test_part as select * from ( select rownum rn from all_tables t1 > ) where rn <= 1000; > > REM Partition lines by Oracle block number, 11 partitions in this example. > select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt > from Test_part > > > Let's see distribution: > > select nt, count(*) from ( select ntile(11) over( order by > dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt; > > NT COUNT(*) > ---------- ---------- > 1 10 > 6 10 > 11 9 > 2 10 > 4 10 > 5 10 > 8 10 > 3 10 > 7 10 > 9 9 > 10 9 > > 11 rows selected. > ^^ It looks good. Sure feel free to chose any other condition to order your > lines as best suits your case > > So you can > 1) have one session reading and then decide where line goes (1 reader ) > 2) Or do multiple reads by specifying partition number. Note that in this > case you read whole table n times (in parallel) and is more internsive on > read part. (multiple readers) > > Regards, > Igor > > > > 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>>: > Good points > > Unfortunately databump. expr, imp use binary format for import and export. > that cannot be used to import data into HDFS in a suitable way. > > One can use what is known as flat,sh script to get data out tab or , > separated etc. > > ROWNUM is a pseudocolumn (not a real column) that is available in a query. > The issue is that in a table of 280Million rows to get the position of the > row it will have to do a table scan since no index cannot be built on it > (assuming there is no other suitable index). Not ideal but can be done. > > I think a better alternative is to use datapump to take that table to > DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a unique > index on the sequence column and do the partitioning there. > > HTH > > > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com > <mailto:guha.a...@gmail.com>> wrote: > In oracle something called row num is present in every row. You can create > an evenly distribution using that column. If it is one time work, try using > sqoop. Are you using Oracle's own appliance? Then you can use data pump format > > On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > creating an Oracle sequence for a table of 200million is not going to be that > easy without changing the schema. It is possible to export that table from > prod and import it to DEV/TEST and create the sequence there. > > If it is a FACT table then the foreign keys from the Dimension tables will be > bitmap indexes on the FACT table so they can be potentially used. > > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com > <mailto:linguin....@gmail.com>> wrote: > Hi, > > Yea, spark does not have the same functionality with sqoop. > I think one of simple solutions is to assign unique ids on the oracle table > by yourself. > Thought? > > // maropu > > > On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <mich.talebza...@gmail.com > <mailto:mich.talebza...@gmail.com>> wrote: > Strange that Oracle table of 200Million plus rows has not been partitioned. > > What matters here is to have parallel connections from JDBC to Oracle, each > reading a sub-set of table. Any parallel fetch is going to be better than > reading with one connection from Oracle. > > Surely among 404 columns there must be one with high cardinality to satisfy > this work. > > May be you should just create table <small> as select * from Oracle_table > where rownum <= 1000000; and use that for test. > > Other alternative is to use Oracle SQL Connecter for HDFS > <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that > can do it for you. With 404 columns it is difficult to suggest any > alternative. Is this a FACT table? > > HTH > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com > <mailto:itsche...@gmail.com>> wrote: > Hello Everyone, > > My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS. > > Table in Oracle: > 1) no primary key. > 2) Has 404 columns. > 3) Has 200,800,000 rows. > > Spark SQL: > In my Spark SQL I want to read the data into n number of partitions in > parallel, for which I need to provide 'partition column','lowerBound', > 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has > no such column to satisfy this need(Highly Skewed), because of it, if the > numPartitions is set to 104, 102 tasks are finished in a minute, 1 task > finishes in 20 mins and the last one takes forever. > > Is there anything I could do to distribute the data evenly into partitions? > Can we set any fake query to orchestrate this pull process, as we do in SQOOP > like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS MIN_MOD_VAL, CAST(12 > AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ? > > Any pointers are appreciated. > > Thanks for your time. > > ~ Ajay > > > > > -- > --- > Takeshi Yamamuro > > > >