Re: Spark_JDBC_Partitions

ayan guha Sat, 10 Sep 2016 14:38:20 -0700

In oracle something called row num is present in every row.  You can create
an evenly distribution using that column. If it is one time work, try using
sqoop. Are you using Oracle's own appliance? Then you can use data pump
format
On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com> wrote:


> creating an Oracle sequence for a table of 200million is not going to be
> that easy without changing the schema. It is possible to export that table
> from prod and import it to DEV/TEST and create the sequence there.
>
> If it is a FACT table then the foreign keys from the Dimension tables will
> be bitmap indexes on the FACT table so they can be potentially used.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Hi,
>>
>> Yea, spark does not have the same functionality with sqoop.
>> I think one of simple solutions is to assign unique ids on the oracle
>> table by yourself.
>> Thought?
>>
>> // maropu
>>
>>
>> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Strange that Oracle table of 200Million plus rows has not been
>>> partitioned.
>>>
>>> What matters here is to have parallel connections from JDBC to Oracle,
>>> each reading a sub-set of table. Any parallel fetch is going to be better
>>> than reading with one connection from Oracle.
>>>
>>> Surely among 404 columns there must be one with high cardinality to
>>> satisfy this work.
>>>
>>> May be you should just create table <small> as select * from
>>> Oracle_table where rownum <= 1000000; and use that for test.
>>>
>>> Other alternative is to use Oracle SQL Connecter for HDFS
>>> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
>>> can do it for you. With 404 columns it is difficult to suggest any
>>> alternative. Is this a FACT table?
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com> wrote:
>>>
>>>> Hello Everyone,
>>>>
>>>> My goal is to use Spark Sql to load huge amount of data from Oracle to
>>>> HDFS.
>>>>
>>>> *Table in Oracle:*
>>>> 1) no primary key.
>>>> 2) Has 404 columns.
>>>> 3) Has 200,800,000 rows.
>>>>
>>>> *Spark SQL:*
>>>> In my Spark SQL I want to read the data into n number of partitions in
>>>> parallel, for which I need to provide 'partition column','lowerBound',
>>>> 'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has
>>>> no such column to satisfy this need(Highly Skewed), because of it, if the
>>>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
>>>> finishes in 20 mins and the last one takes forever.
>>>>
>>>> Is there anything I could do to distribute the data evenly into
>>>> partitions? Can we set any fake query to orchestrate this pull process, as
>>>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
>>>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>>>>
>>>> Any pointers are appreciated.
>>>>
>>>> Thanks for your time.
>>>>
>>>> ~ Ajay
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>

Re: Spark_JDBC_Partitions

Reply via email to