Re: Spark_JDBC_Partitions

Suresh Thalamati Tue, 13 Sep 2016 11:06:06 -0700

There is also another  jdbc method in  data frame  reader api o specify your 
own predicates for  each partition. Using this you can control what is included 
in  each partition.


val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 
200")
val df = spark.read.jdbc(
  urlWithUserAndPass,
  "TEST.PEOPLE",
  predicates = jdbcPartitionWhereClause,
  new Properties())


Hope that helps. 
-suresh


> On Sep 13, 2016, at 9:44 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com> 
> wrote:
> 
> Trust me, Only thing that can help you in your situation is SQOOP oracle 
> direct connector which is known as  ORAOOP. Spark cannot do everything , 
> you need a OOZIE workflow which will trigger sqoop job with oracle direct 
> connector to pull the data then spark batch to process .
> 
> Hope it helps !!
> 
> On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic <igor.ra...@gmail.com 
> <mailto:igor.ra...@gmail.com>> wrote:
> Hi, 
> 
> One way can be to use NTILE function to partition data. 
> Example:
> 
> REM Creating test table
> create table Test_part as select * from ( select rownum rn from all_tables t1 
> ) where rn <= 1000;
> 
> REM Partition lines by Oracle block number, 11 partitions in this example. 
> select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt 
> from Test_part
> 
> 
> Let's see distribution: 
> 
> select nt, count(*) from ( select ntile(11) over( order by 
> dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt;
> 
>         NT   COUNT(*)
> ---------- ----------
>          1         10
>          6         10
>         11          9
>          2         10
>          4         10
>          5         10
>          8         10
>          3         10
>          7         10
>          9          9
>         10          9
> 
> 11 rows selected.
> ^^ It looks good. Sure feel free to chose any other condition to order your 
> lines as best suits your case
> 
> So you can 
> 1) have one session reading and then decide where line goes (1 reader )
> 2) Or do multiple reads by specifying partition number. Note that in this 
> case you read whole table n times (in parallel) and is more internsive on 
> read part. (multiple readers)
> 
> Regards, 
> Igor
> 
> 
> 
> 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>>:
> Good points
> 
> Unfortunately databump. expr, imp use binary format for import and export. 
> that cannot be used to import data into HDFS in a suitable way.
> 
> One can use what is known as flat,sh script to get data out tab or , 
> separated etc.
> 
> ROWNUM is a pseudocolumn (not a real column) that is available in a query. 
> The issue is that in a table of 280Million rows to get the position of the 
> row it will have to do a table scan since no index cannot be built on it 
> (assuming there is no other suitable index). Not ideal but can be done.
> 
> I think a better alternative is to use datapump to take that table to 
> DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a unique 
> index on the sequence column and do the partitioning there.
> 
> HTH
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com 
> <mailto:guha.a...@gmail.com>> wrote:
> In oracle something called row num is present in every row.  You can create 
> an evenly distribution using that column. If it is one time work, try using 
> sqoop. Are you using Oracle's own appliance? Then you can use data pump format
> 
> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> creating an Oracle sequence for a table of 200million is not going to be that 
> easy without changing the schema. It is possible to export that table from 
> prod and import it to DEV/TEST and create the sequence there.
> 
> If it is a FACT table then the foreign keys from the Dimension tables will be 
> bitmap indexes on the FACT table so they can be potentially used.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com 
> <mailto:linguin....@gmail.com>> wrote:
> Hi,
> 
> Yea, spark does not have the same functionality with sqoop.
> I think one of simple solutions is to assign unique ids on the oracle table 
> by yourself.
> Thought?
> 
> // maropu
> 
> 
> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> Strange that Oracle table of 200Million plus rows has not been partitioned.
> 
> What matters here is to have parallel connections from JDBC to Oracle, each 
> reading a sub-set of table. Any parallel fetch is going to be better than 
> reading with one connection from Oracle.
> 
> Surely among 404 columns there must be one with high cardinality to satisfy 
> this work.
> 
> May be you should just create table <small> as select * from Oracle_table 
> where rownum <= 1000000; and use that for test.
> 
> Other alternative is to use Oracle SQL Connecter for HDFS  
> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that 
> can do it for you. With 404 columns it is difficult to suggest any 
> alternative. Is this a FACT table?
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com 
> <mailto:itsche...@gmail.com>> wrote:
> Hello Everyone,
> 
> My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS.
> 
> Table in Oracle:
> 1) no primary key.
> 2) Has 404 columns.
> 3) Has 200,800,000 rows.
> 
> Spark SQL:
> In my Spark SQL I want to read the data into n number of partitions in 
> parallel, for which I need to provide 'partition column','lowerBound', 
> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has 
> no such column to satisfy this need(Highly Skewed), because of it, if the 
> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task 
> finishes in 20 mins and the last one takes forever. 
> 
> Is there anything I could do to distribute the data evenly into partitions? 
> Can we set any fake query to orchestrate this pull process, as we do in SQOOP 
> like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS MIN_MOD_VAL, CAST(12 
> AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
> 
> Any pointers are appreciated.
> 
> Thanks for your time.
> 
> ~ Ajay
> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
>

Re: Spark_JDBC_Partitions

Reply via email to