Thank you all for your valuable inputs. Sorry for getting back late because
of personal issues.

Mich, answer to your earlier question, Yes it is a fact table. Thank you.

Ayan, I have tried ROWNUM as split column with 100 partitions. But it was
taking forever to complete the job. Thank you.

Igor, I will try your solution and let you know. Thank you.

Rabin, I agree it can be done through Sqoop. But I wanted to use Spark SQL
and see how fast would it be when compared to Sqoop. Thank you.

Suresh, If I have 100 number of partitions, do I have to provide 100
conditions(1 for 1 partition) to load the data into each partition
respectively? Thank you.






On Tue, Sep 13, 2016 at 2:05 PM, Suresh Thalamati <
suresh.thalam...@gmail.com> wrote:

> There is also another  jdbc method in  data frame  reader api o specify
> your own predicates for  each partition. Using this you can control what is
> included in  each partition.
>
> val jdbcPartitionWhereClause = Array[String]("id < 100" , "id >=100 and id < 
> 200")
> val df = spark.read.jdbc(
>   urlWithUserAndPass,
>   "TEST.PEOPLE",
>   predicates = jdbcPartitionWhereClause,
>
>   new Properties())
>
>
>
> Hope that helps.
> -suresh
>
>
> On Sep 13, 2016, at 9:44 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com>
> wrote:
>
> Trust me, Only thing that can help you in your situation is SQOOP oracle
> direct connector which is known as  ORAOOP. Spark cannot do everything ,
> you need a OOZIE workflow which will trigger sqoop job with oracle direct
> connector to pull the data then spark batch to process .
>
> Hope it helps !!
>
> On Tue, Sep 13, 2016 at 6:10 PM, Igor Racic <igor.ra...@gmail.com> wrote:
>
>> Hi,
>>
>> One way can be to use NTILE function to partition data.
>> Example:
>>
>> REM Creating test table
>> create table Test_part as select * from ( select rownum rn from
>> all_tables t1 ) where rn <= 1000;
>>
>> REM Partition lines by Oracle block number, 11 partitions in this
>> example.
>> select ntile(11) over( order by dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) )
>> nt from Test_part
>>
>>
>> Let's see distribution:
>>
>> select nt, count(*) from ( select ntile(11) over( order by
>> dbms_rowid.ROWID_BLOCK_NUMBER( rowid ) ) nt from Test_part) group by nt;
>>
>>         NT   COUNT(*)
>> ---------- ----------
>>          1         10
>>          6         10
>>         11          9
>>          2         10
>>          4         10
>>          5         10
>>          8         10
>>          3         10
>>          7         10
>>          9          9
>>         10          9
>>
>> 11 rows selected.
>> ^^ It looks good. Sure feel free to chose any other condition to order
>> your lines as best suits your case
>>
>> So you can
>> 1) have one session reading and then decide where line goes (1 reader )
>> 2) Or do multiple reads by specifying partition number. Note that in this
>> case you read whole table n times (in parallel) and is more internsive on
>> read part. (multiple readers)
>>
>> Regards,
>> Igor
>>
>>
>>
>> 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh <mich.talebza...@gmail.com>:
>>
>>> Good points
>>>
>>> Unfortunately databump. expr, imp use binary format for import and
>>> export. that cannot be used to import data into HDFS in a suitable way.
>>>
>>> One can use what is known as flat,sh script to get data out tab or ,
>>> separated etc.
>>>
>>> ROWNUM is a pseudocolumn (not a real column) that is available in a
>>> query. The issue is that in a table of 280Million rows to get the position
>>> of the row it will have to do a table scan since no index cannot be built
>>> on it (assuming there is no other suitable index). Not ideal but can be
>>> done.
>>>
>>> I think a better alternative is to use datapump to take that table to
>>> DEV/TEST, add a sequence (like an IDENTITY column in Sybase), build a
>>> unique index on the sequence column and do the partitioning there.
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> In oracle something called row num is present in every row.  You can
>>>> create an evenly distribution using that column. If it is one time work,
>>>> try using sqoop. Are you using Oracle's own appliance? Then you can use
>>>> data pump format
>>>> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>> creating an Oracle sequence for a table of 200million is not going to
>>>>> be that easy without changing the schema. It is possible to export that
>>>>> table from prod and import it to DEV/TEST and create the sequence there.
>>>>>
>>>>> If it is a FACT table then the foreign keys from the Dimension tables
>>>>> will be bitmap indexes on the FACT table so they can be potentially used.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Yea, spark does not have the same functionality with sqoop.
>>>>>> I think one of simple solutions is to assign unique ids on the oracle
>>>>>> table by yourself.
>>>>>> Thought?
>>>>>>
>>>>>> // maropu
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Strange that Oracle table of 200Million plus rows has not been
>>>>>>> partitioned.
>>>>>>>
>>>>>>> What matters here is to have parallel connections from JDBC to
>>>>>>> Oracle, each reading a sub-set of table. Any parallel fetch is going to 
>>>>>>> be
>>>>>>> better than reading with one connection from Oracle.
>>>>>>>
>>>>>>> Surely among 404 columns there must be one with high cardinality to
>>>>>>> satisfy this work.
>>>>>>>
>>>>>>> May be you should just create table <small> as select * from
>>>>>>> Oracle_table where rownum <= 1000000; and use that for test.
>>>>>>>
>>>>>>> Other alternative is to use Oracle SQL Connecter for HDFS
>>>>>>> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
>>>>>>> can do it for you. With 404 columns it is difficult to suggest any
>>>>>>> alternative. Is this a FACT table?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Everyone,
>>>>>>>>
>>>>>>>> My goal is to use Spark Sql to load huge amount of data from Oracle
>>>>>>>> to HDFS.
>>>>>>>>
>>>>>>>> *Table in Oracle:*
>>>>>>>> 1) no primary key.
>>>>>>>> 2) Has 404 columns.
>>>>>>>> 3) Has 200,800,000 rows.
>>>>>>>>
>>>>>>>> *Spark SQL:*
>>>>>>>> In my Spark SQL I want to read the data into n number of partitions
>>>>>>>> in parallel, for which I need to provide 'partition 
>>>>>>>> column','lowerBound',
>>>>>>>> 'upperbound', 'numPartitions' from the table Oracle. My table in 
>>>>>>>> Oracle has
>>>>>>>> no such column to satisfy this need(Highly Skewed), because of it, if 
>>>>>>>> the
>>>>>>>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
>>>>>>>> finishes in 20 mins and the last one takes forever.
>>>>>>>>
>>>>>>>> Is there anything I could do to distribute the data evenly into
>>>>>>>> partitions? Can we set any fake query to orchestrate this pull 
>>>>>>>> process, as
>>>>>>>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
>>>>>>>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>>>>>>>>
>>>>>>>> Any pointers are appreciated.
>>>>>>>>
>>>>>>>> Thanks for your time.
>>>>>>>>
>>>>>>>> ~ Ajay
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>

Reply via email to