Re: Sqoop vs spark jdbc

2016-09-21 Thread Don Drake
2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.job.classpath.files is deprecated. Instead, use
>> mapreduce.job.classpath.files
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> user.name is deprecated. Instead, use mapreduce.job.user.name
>> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] -
>> mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
>> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
>> mapred.cache.files.filesizes is deprecated. Instead, use
>> mapreduce.job.cache.files.filesizes
>> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
>> mapred.output.key.class is deprecated. Instead, use
>> mapreduce.job.output.key.class
>> 2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] -
>> Submitting tokens for job: job_1474455325627_0045
>> 2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] -
>> Submitted application application_1474455325627_0045 to ResourceManager at
>> rhes564/50.140.197.217:8032
>> 2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to
>> track the job: http://http://rhes564:8088/pro
>> xy/application_1474455325627_0045/
>> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job:
>> job_1474455325627_0045
>> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job
>> job_1474455325627_0045 running in uber mode : false
>> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce
>> 0%
>> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25%
>> reduce 0%
>> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50%
>> reduce 0%
>> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75%
>> reduce 0%
>> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100%
>> reduce 0%
>>
>> *2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job
>> job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501
>> [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant
>> org.apache.hadoop.mapreduce.Jo
>> <http://org.apache.hadoop.mapreduce.Jo>bCounter.MB_MILLIS_MAPS*
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com>
>> wrote:
>>
>>> Uhmmm…
>>>
>>> A bit of a longer-ish answer…
>>>
>>> Spark may or may not be faster than sqoop. The standard caveats apply…
>>> YMMV.
>>>
>>> The reason I say this… you have a couple of limiting factors.  The main
>>> one being the number of connections allowed with the target RDBMS.
>>>
>>> Then there’s the data distribution within the partitions / ranges in the
>>> database.
>>> By this, I mean that using any parallel solution, you need to run copies
>>> of your query in parallel over different ranges within the database. Most
>>> of the time you may run the query over a database where there is even
>>> distribution… if not, then you will have one thread run longer than the
>>> others.  Note that this is a problem that both solutions would face.
>>>
>>> Then there’s the cluster itself.
>>> Again YMMV on your spark job vs a Map/Reduce job.
>>>
>>> In terms of launching the job, setup, etc … the spark job could take
>>> longer to setup.  But on long running queries, that becomes noise.
>>>
>>> The issue is what makes the most sense to you, where do you have the
>>> most experience, and wh

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
272] - The url to
> track the job: http://http://rhes564:8088/proxy/application_
> 1474455325627_0045/
> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job:
> job_1474455325627_0045
> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job
> job_1474455325627_0045 running in uber mode : false
> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce
> 0%
> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce
> 0%
> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce
> 0%
> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce
> 0%
> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100%
> reduce 0%
>
> *2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job
> job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501
> [myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant
> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS*
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com>
> wrote:
>
>> Uhmmm…
>>
>> A bit of a longer-ish answer…
>>
>> Spark may or may not be faster than sqoop. The standard caveats apply…
>> YMMV.
>>
>> The reason I say this… you have a couple of limiting factors.  The main
>> one being the number of connections allowed with the target RDBMS.
>>
>> Then there’s the data distribution within the partitions / ranges in the
>> database.
>> By this, I mean that using any parallel solution, you need to run copies
>> of your query in parallel over different ranges within the database. Most
>> of the time you may run the query over a database where there is even
>> distribution… if not, then you will have one thread run longer than the
>> others.  Note that this is a problem that both solutions would face.
>>
>> Then there’s the cluster itself.
>> Again YMMV on your spark job vs a Map/Reduce job.
>>
>> In terms of launching the job, setup, etc … the spark job could take
>> longer to setup.  But on long running queries, that becomes noise.
>>
>> The issue is what makes the most sense to you, where do you have the most
>> experience, and what do you feel the most comfortable in using.
>>
>> The other issue is what do you do with the data (RDDs,DataSets, Frames,
>> etc) once you have read the data?
>>
>>
>> HTH
>>
>> -Mike
>>
>> PS. I know that I’m responding to an earlier message in the thread, but
>> this is something that I’ve heard lots of questions about… and its not a
>> simple thing to answer… Since this is a batch process.  The performance
>> issues are moot.
>>
>> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Personally I prefer Spark JDBC.
>>
>> Both Sqoop and Spark rely on the same drivers.
>>
>> I think Spark is faster and if you have many nodes you can partition your
>> incoming data and take advantage of Spark DAG + in memory offering.
>>
>> By default Sqoop will use Map-reduce which is pretty slow.
>>
>> Remember for Spark you will need to have sufficient memory
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 August 2016 at 22:39, Venkata Penikalapati <
>> mail.venkatakart...@gmail.com> wrote:
>>
>>> Team,
>>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>>> ?
>>>
>>> I'm performing few analytics using spark data for which data is residing
>>> in rdbms.
>>>
>>> Please guide me with this.
>>>
>>>
>>> Thanks
>>> Venkata Karthik P
>>>
>>>
>>
>>
>


Re: Sqoop vs spark jdbc

2016-09-21 Thread Jörn Franke
stead, use 
> mapreduce.output.fileoutputformat.outputdir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.job.classpath.files is deprecated. Instead, use 
> mapreduce.job.classpath.files
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - user.name 
> is deprecated. Instead, use mapreduce.job.user.name
> 2016-09-21 21:00:23,548 [myid:] - INFO  [main:Configuration@840] - 
> mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] - 
> mapred.cache.files.filesizes is deprecated. Instead, use 
> mapreduce.job.cache.files.filesizes
> 2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] - 
> mapred.output.key.class is deprecated. Instead, use 
> mapreduce.job.output.key.class
> 2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] - Submitting 
> tokens for job: job_1474455325627_0045
> 2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] - Submitted 
> application application_1474455325627_0045 to ResourceManager at 
> rhes564/50.140.197.217:8032
> 2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to track 
> the job: http://http://rhes564:8088/proxy/application_1474455325627_0045/
> 2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job: 
> job_1474455325627_0045
> 2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job 
> job_1474455325627_0045 running in uber mode : false
> 2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce 0%
> 2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce 0%
> 2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce 0%
> 2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce 0%
> 2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100% reduce 0%
> 2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job 
> job_1474455325627_0045 completed successfully
> 2016-09-21 21:00:56,501 [myid:] - ERROR [main:ImportTool@607] - Imported 
> Failed: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com> 
>> wrote:
>> Uhmmm… 
>> 
>> A bit of a longer-ish answer…
>> 
>> Spark may or may not be faster than sqoop. The standard caveats apply… YMMV. 
>> 
>> The reason I say this… you have a couple of limiting factors.  The main one 
>> being the number of connections allowed with the target RDBMS. 
>> 
>> Then there’s the data distribution within the partitions / ranges in the 
>> database.  
>> By this, I mean that using any parallel solution, you need to run copies of 
>> your query in parallel over different ranges within the database. Most of 
>> the time you may run the query over a database where there is even 
>> distribution… if not, then you will have one thread run longer than the 
>> others.  Note that this is a problem that both solutions would face. 
>> 
>> Then there’s the cluster itself. 
>> Again YMMV on your spark job vs a Map/Reduce job. 
>> 
>> In terms of launching the job, setup, etc … the spark job could take longer 
>> to setup.  But on long running queries, that becomes noise. 
>> 
>> The issue is what makes the most sense to you, where do you have the most 
>> experience, and what do you feel the most comfortable in using. 
>> 
>> The other issue is what do you do with the data (RDDs,DataSets, Frames, etc) 
>> once you have read the data? 
>> 
>> 
>> HTH
>> 
>> -Mike
>> 
>> PS. I know that I’m responding to an earlier message in the thread, but this 
>> is something that I’ve heard lots of questions about… and its not a simple 
>> thing to answer… Since this is a batch process.  The per

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
] -
mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
mapred.cache.files.filesizes is deprecated. Instead, use
mapreduce.job.cache.files.filesizes
2016-09-21 21:00:23,549 [myid:] - INFO  [main:Configuration@840] -
mapred.output.key.class is deprecated. Instead, use
mapreduce.job.output.key.class
2016-09-21 21:00:23,656 [myid:] - INFO  [main:JobSubmitter@477] -
Submitting tokens for job: job_1474455325627_0045
2016-09-21 21:00:23,955 [myid:] - INFO  [main:YarnClientImpl@174] -
Submitted application application_1474455325627_0045 to ResourceManager at
rhes564/50.140.197.217:8032
2016-09-21 21:00:23,980 [myid:] - INFO  [main:Job@1272] - The url to track
the job: http://http://rhes564:8088/proxy/application_1474455325627_0045/
2016-09-21 21:00:23,981 [myid:] - INFO  [main:Job@1317] - Running job:
job_1474455325627_0045
2016-09-21 21:00:31,180 [myid:] - INFO  [main:Job@1338] - Job
job_1474455325627_0045 running in uber mode : false
2016-09-21 21:00:31,182 [myid:] - INFO  [main:Job@1345] -  map 0% reduce 0%
2016-09-21 21:00:40,260 [myid:] - INFO  [main:Job@1345] -  map 25% reduce 0%
2016-09-21 21:00:44,283 [myid:] - INFO  [main:Job@1345] -  map 50% reduce 0%
2016-09-21 21:00:48,308 [myid:] - INFO  [main:Job@1345] -  map 75% reduce 0%
2016-09-21 21:00:55,346 [myid:] - INFO  [main:Job@1345] -  map 100% reduce
0%

*2016-09-21 21:00:56,359 [myid:] - INFO  [main:Job@1356] - Job
job_1474455325627_0045 completed successfully2016-09-21 21:00:56,501
[myid:] - ERROR [main:ImportTool@607] - Imported Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS*







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 21 September 2016 at 20:56, Michael Segel <michael_se...@hotmail.com>
wrote:

> Uhmmm…
>
> A bit of a longer-ish answer…
>
> Spark may or may not be faster than sqoop. The standard caveats apply…
> YMMV.
>
> The reason I say this… you have a couple of limiting factors.  The main
> one being the number of connections allowed with the target RDBMS.
>
> Then there’s the data distribution within the partitions / ranges in the
> database.
> By this, I mean that using any parallel solution, you need to run copies
> of your query in parallel over different ranges within the database. Most
> of the time you may run the query over a database where there is even
> distribution… if not, then you will have one thread run longer than the
> others.  Note that this is a problem that both solutions would face.
>
> Then there’s the cluster itself.
> Again YMMV on your spark job vs a Map/Reduce job.
>
> In terms of launching the job, setup, etc … the spark job could take
> longer to setup.  But on long running queries, that becomes noise.
>
> The issue is what makes the most sense to you, where do you have the most
> experience, and what do you feel the most comfortable in using.
>
> The other issue is what do you do with the data (RDDs,DataSets, Frames,
> etc) once you have read the data?
>
>
> HTH
>
> -Mike
>
> PS. I know that I’m responding to an earlier message in the thread, but
> this is something that I’ve heard lots of questions about… and its not a
> simple thing to answer… Since this is a batch process.  The performance
> issues are moot.
>
> On Aug 24, 2016, at 5:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Personally I prefer Spark JDBC.
>
> Both Sqoop and Spark rely on the same drivers.
>
> I think Spark is faster and if you have many nodes you can partition your
> incoming data and take advantage of Spark DAG + in memory offering.
>
> By default Sqoop will use Map-reduce which is pretty slow.
>
> Remember for Spark you will need to have sufficient memory
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will 

Re: Sqoop vs spark jdbc

2016-09-21 Thread Mich Talebzadeh
t; It does work, opens parallel connections to Oracle DB and creates DF
>>> with the specified number of partitions.
>>>
>>> One thing I am not sure or tried if Spark supports direct mode yet.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 25 August 2016 at 09:07, Bhaskar Dutta <bhas...@gmail.com> wrote:
>>>
>>>> Which RDBMS are you using here, and what is the data volume and
>>>> frequency of pulling data off the RDBMS?
>>>> Specifying these would help in giving better answers.
>>>>
>>>> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and
>>>> Oracle, so you can use that for better performance if using one of these
>>>> databases.
>>>>
>>>> And don't forget that you Sqoop can load data directly into Parquet or
>>>> Avro (I think direct mode is not supported in this case).
>>>> Also you can use Kite SDK with Sqoop to manage/transform datasets,
>>>> perform schema evolution and such.
>>>>
>>>> ~bhaskar
>>>>
>>>>
>>>> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
>>>> mail.venkatakart...@gmail.com> wrote:
>>>>
>>>>> Team,
>>>>> Please help me in choosing sqoop or spark jdbc to fetch data from
>>>>> rdbms. Sqoop has lot of optimizations to fetch data does spark jdbc also
>>>>> has those ?
>>>>>
>>>>> I'm performing few analytics using spark data for which data is
>>>>> residing in rdbms.
>>>>>
>>>>> Please guide me with this.
>>>>>
>>>>>
>>>>> Thanks
>>>>> Venkata Karthik P
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Sqoop on Spark

2016-09-14 Thread Mich Talebzadeh
Sqoop is a standalone product (a utility) that is used to get data out of
JDBC compliant database tables into HDFS and Hive if specified.
Spark can also use JDBC to get data out from such tables. However, I have
not come across a situation where Sqoop is invoked from Spark.

Have a look at Sqoop doc
<https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html>


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 15:31, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com
> wrote:

> Hi Experts,
>
> Good morning.
>
> I am looking for some references on how to use sqoop with spark. could you
> please let me know if there are any references on how to use it.
>
> Thanks,
> Asmath.
>


Sqoop on Spark

2016-09-14 Thread KhajaAsmath Mohammed
Hi Experts,

Good morning.

I am looking for some references on how to use sqoop with spark. could you
please let me know if there are any references on how to use it.

Thanks,
Asmath.


Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
Hi,

I am using Hadoop 2.6

hduser@rhes564: /home/hduser/dba/bin>
*hadoop version*Hadoop 2.6.0

Thanks






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 11:48, Bhaskar Dutta <bhas...@gmail.com> wrote:

> This constant was added in Hadoop 2.3. Maybe you are using an older
> version?
>
> ~bhaskar
>
> On Thu, Aug 25, 2016 at 3:04 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Actually I started using Spark to import data from RDBMS (in this case
>> Oracle) after upgrading to Hive 2, running an import like below
>>
>> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12"
>> --username scratchpad -P \
>> --query "select * from scratchpad.dummy2 where \
>>  \$CONDITIONS" \
>>   --split-by ID \
>>--hive-import  --hive-table "test.dumy2" --target-dir
>> "/tmp/dummy2" *--direct*
>>
>> This gets the data into HDFS and then throws this error
>>
>> ERROR [main] tool.ImportTool: Imported Failed: No enum constant
>> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
>>
>> I can easily get the data into Hive from the file on HDFS or dig into the
>> problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
>> free like below
>>
>>  val df = HiveContext.read.format("jdbc").options(
>>  Map("url" -> dbURL,
>>  "dbtable" -> "scratchpad.dummy)",
>>  "partitionColumn" -> partitionColumnName,
>>  "lowerBound" -> lowerBoundValue,
>>  "upperBound" -> upperBoundValue,
>>  "numPartitions" -> numPartitionsValue,
>>  "user" -> dbUserName,
>>  "password" -> dbPassword)).load
>>
>> It does work, opens parallel connections to Oracle DB and creates DF with
>> the specified number of partitions.
>>
>> One thing I am not sure or tried if Spark supports direct mode yet.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 25 August 2016 at 09:07, Bhaskar Dutta <bhas...@gmail.com> wrote:
>>
>>> Which RDBMS are you using here, and what is the data volume and
>>> frequency of pulling data off the RDBMS?
>>> Specifying these would help in giving better answers.
>>>
>>> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and
>>> Oracle, so you can use that for better performance if using one of these
>>> databases.
>>>
>>> And don't forget that you Sqoop can load data directly into Parquet or
>>> Avro (I think direct mode is not supported in this case).
>>> Also you can use Kite SDK with Sqoop to manage/transform datasets,
>>> perform schema evolution and such.
>>>
>>> ~bhaskar
>>>
>>>
>>> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
>>> mail.venkatakart...@gmail.com> wrote:
>>>
>>>> Team,
>>>> Please help me in choosing sqoop or spark jdbc to fetch data from
>>>> rdbms. Sqoop has lot of optimizations to fetch data does spark jdbc also
>>>> has those ?
>>>>
>>>> I'm performing few analytics using spark data for which data is
>>>> residing in rdbms.
>>>>
>>>> Please guide me with this.
>>>>
>>>>
>>>> Thanks
>>>> Venkata Karthik P
>>>>
>>>>
>>>
>>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
This constant was added in Hadoop 2.3. Maybe you are using an older version?

~bhaskar

On Thu, Aug 25, 2016 at 3:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Actually I started using Spark to import data from RDBMS (in this case
> Oracle) after upgrading to Hive 2, running an import like below
>
> sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
> scratchpad -P \
> --query "select * from scratchpad.dummy2 where \
>  \$CONDITIONS" \
>   --split-by ID \
>--hive-import  --hive-table "test.dumy2" --target-dir
> "/tmp/dummy2" *--direct*
>
> This gets the data into HDFS and then throws this error
>
> ERROR [main] tool.ImportTool: Imported Failed: No enum constant
> org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS
>
> I can easily get the data into Hive from the file on HDFS or dig into the
> problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
> free like below
>
>  val df = HiveContext.read.format("jdbc").options(
>  Map("url" -> dbURL,
>  "dbtable" -> "scratchpad.dummy)",
>  "partitionColumn" -> partitionColumnName,
>  "lowerBound" -> lowerBoundValue,
>  "upperBound" -> upperBoundValue,
>  "numPartitions" -> numPartitionsValue,
>  "user" -> dbUserName,
>  "password" -> dbPassword)).load
>
> It does work, opens parallel connections to Oracle DB and creates DF with
> the specified number of partitions.
>
> One thing I am not sure or tried if Spark supports direct mode yet.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 09:07, Bhaskar Dutta <bhas...@gmail.com> wrote:
>
>> Which RDBMS are you using here, and what is the data volume and frequency
>> of pulling data off the RDBMS?
>> Specifying these would help in giving better answers.
>>
>> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and
>> Oracle, so you can use that for better performance if using one of these
>> databases.
>>
>> And don't forget that you Sqoop can load data directly into Parquet or
>> Avro (I think direct mode is not supported in this case).
>> Also you can use Kite SDK with Sqoop to manage/transform datasets,
>> perform schema evolution and such.
>>
>> ~bhaskar
>>
>>
>> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
>> mail.venkatakart...@gmail.com> wrote:
>>
>>> Team,
>>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>>> ?
>>>
>>> I'm performing few analytics using spark data for which data is residing
>>> in rdbms.
>>>
>>> Please guide me with this.
>>>
>>>
>>> Thanks
>>> Venkata Karthik P
>>>
>>>
>>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Mich Talebzadeh
Actually I started using Spark to import data from RDBMS (in this case
Oracle) after upgrading to Hive 2, running an import like below

sqoop import --connect "jdbc:oracle:thin:@rhes564:1521:mydb12" --username
scratchpad -P \
--query "select * from scratchpad.dummy2 where \
 \$CONDITIONS" \
  --split-by ID \
   --hive-import  --hive-table "test.dumy2" --target-dir
"/tmp/dummy2" *--direct*

This gets the data into HDFS and then throws this error

ERROR [main] tool.ImportTool: Imported Failed: No enum constant
org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_MAPS

I can easily get the data into Hive from the file on HDFS or dig into the
problem (Spark 2, Hive 2, Hadoop 2.6, Sqoop 1.4.5) but I find Spark trouble
free like below

 val df = HiveContext.read.format("jdbc").options(
 Map("url" -> dbURL,
 "dbtable" -> "scratchpad.dummy)",
 "partitionColumn" -> partitionColumnName,
 "lowerBound" -> lowerBoundValue,
 "upperBound" -> upperBoundValue,
 "numPartitions" -> numPartitionsValue,
 "user" -> dbUserName,
 "password" -> dbPassword)).load

It does work, opens parallel connections to Oracle DB and creates DF with
the specified number of partitions.

One thing I am not sure or tried if Spark supports direct mode yet.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 25 August 2016 at 09:07, Bhaskar Dutta <bhas...@gmail.com> wrote:

> Which RDBMS are you using here, and what is the data volume and frequency
> of pulling data off the RDBMS?
> Specifying these would help in giving better answers.
>
> Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and Oracle,
> so you can use that for better performance if using one of these databases.
>
> And don't forget that you Sqoop can load data directly into Parquet or
> Avro (I think direct mode is not supported in this case).
> Also you can use Kite SDK with Sqoop to manage/transform datasets, perform
> schema evolution and such.
>
> ~bhaskar
>
>
> On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
> mail.venkatakart...@gmail.com> wrote:
>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Bhaskar Dutta
Which RDBMS are you using here, and what is the data volume and frequency
of pulling data off the RDBMS?
Specifying these would help in giving better answers.

Sqoop has a direct mode (non-JDBC) support for Postgres, MySQL and Oracle,
so you can use that for better performance if using one of these databases.

And don't forget that you Sqoop can load data directly into Parquet or Avro
(I think direct mode is not supported in this case).
Also you can use Kite SDK with Sqoop to manage/transform datasets, perform
schema evolution and such.

~bhaskar

On Thu, Aug 25, 2016 at 3:09 AM, Venkata Penikalapati <
mail.venkatakart...@gmail.com> wrote:

> Team,
> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
> ?
>
> I'm performing few analytics using spark data for which data is residing
> in rdbms.
>
> Please guide me with this.
>
>
> Thanks
> Venkata Karthik P
>
>


Re: Sqoop vs spark jdbc

2016-08-25 Thread Sean Owen
Sqoop is probably the more mature tool for the job. It also just does
one thing. The argument for doing it in Spark would be wanting to
integrate it with a larger workflow. I imagine Sqoop would be more
efficient and flexible for just the task of ingest, including
continuously pulling deltas which I am not sure Spark really does for
you.

MapReduce won't matter here. The bottleneck is reading from the RDBMS
in general.

On Wed, Aug 24, 2016 at 11:07 PM, Mich Talebzadeh
<mich.talebza...@gmail.com> wrote:
> Personally I prefer Spark JDBC.
>
> Both Sqoop and Spark rely on the same drivers.
>
> I think Spark is faster and if you have many nodes you can partition your
> incoming data and take advantage of Spark DAG + in memory offering.
>
> By default Sqoop will use Map-reduce which is pretty slow.
>
> Remember for Spark you will need to have sufficient memory
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 24 August 2016 at 22:39, Venkata Penikalapati
> <mail.venkatakart...@gmail.com> wrote:
>>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Sqoop vs spark jdbc

2016-08-24 Thread ayan guha
Hi

Adding one more lense to it: If we are talking about one-off migration use
case, or weekly synch - sqoop would be a better choice. If we are talking
about regular data feeding from DB to Hadoop, and doing some transformation
in the pipe, spark will do better.

On Thu, Aug 25, 2016 at 2:08 PM, Ranadip Chatterjee <ranadi...@gmail.com>
wrote:

> This will depend on multiple factors. Assuming we are talking significant
> volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion
> performance is the sole consideration (which is true in many production use
> cases). Sqoop provides some potential optimisations specially around using
> native database batch extraction tools that spark cannot take advantage of.
> The performance inefficiency of using MR (actually map-only) is
> insignificant over a large corpus of data. Further, in a shared cluster, if
> the data volume is skewed for the given partition key, spark, without
> dynamic container allocation, can be significantly inefficient from cluster
> resources usage perspective. With dynamic allocation enabled, it is less so
> but sqoop still has a slight edge due to the time Spark holds on to the
> resources before giving them up.
>
> If ingestion is part of a more complex DAG that relies on Spark cache (rdd
> / dataframe or dataset), then using Spark jdbc can have a significant
> advantage in being able to cache the data without persisting into hdfs
> first. But whether this will convert into an overall significantly better
> performance of the DAG or cluster will depend on the DAG stages and their
> performance. In general, if the ingestion stage is the significant
> bottleneck in the DAG, then the advantage will be significant.
>
> Hope this provides a general direction to consider in your case.
>
> On 25 Aug 2016 3:09 a.m., "Venkata Penikalapati" <
> mail.venkatakart...@gmail.com> wrote:
>
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>>
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>>
>> Please guide me with this.
>>
>>
>> Thanks
>> Venkata Karthik P
>>
>>


-- 
Best Regards,
Ayan Guha


Re: Sqoop vs spark jdbc

2016-08-24 Thread Ranadip Chatterjee
This will depend on multiple factors. Assuming we are talking significant
volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion
performance is the sole consideration (which is true in many production use
cases). Sqoop provides some potential optimisations specially around using
native database batch extraction tools that spark cannot take advantage of.
The performance inefficiency of using MR (actually map-only) is
insignificant over a large corpus of data. Further, in a shared cluster, if
the data volume is skewed for the given partition key, spark, without
dynamic container allocation, can be significantly inefficient from cluster
resources usage perspective. With dynamic allocation enabled, it is less so
but sqoop still has a slight edge due to the time Spark holds on to the
resources before giving them up.

If ingestion is part of a more complex DAG that relies on Spark cache (rdd
/ dataframe or dataset), then using Spark jdbc can have a significant
advantage in being able to cache the data without persisting into hdfs
first. But whether this will convert into an overall significantly better
performance of the DAG or cluster will depend on the DAG stages and their
performance. In general, if the ingestion stage is the significant
bottleneck in the DAG, then the advantage will be significant.

Hope this provides a general direction to consider in your case.

On 25 Aug 2016 3:09 a.m., "Venkata Penikalapati" <
mail.venkatakart...@gmail.com> wrote:

> Team,
> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
> ?
>
> I'm performing few analytics using spark data for which data is residing
> in rdbms.
>
> Please guide me with this.
>
>
> Thanks
> Venkata Karthik P
>
>


Re: Sqoop vs spark jdbc

2016-08-24 Thread Mich Talebzadeh
Personally I prefer Spark JDBC.

Both Sqoop and Spark rely on the same drivers.

I think Spark is faster and if you have many nodes you can partition your
incoming data and take advantage of Spark DAG + in memory offering.

By default Sqoop will use Map-reduce which is pretty slow.

Remember for Spark you will need to have sufficient memory

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 22:39, Venkata Penikalapati <
mail.venkatakart...@gmail.com> wrote:

> Team,
> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
> ?
>
> I'm performing few analytics using spark data for which data is residing
> in rdbms.
>
> Please guide me with this.
>
>
> Thanks
> Venkata Karthik P
>
>


Sqoop vs spark jdbc

2016-08-24 Thread Venkata Penikalapati
Team, Please help me in choosing sqoop or spark jdbc to fetch data from rdbms. 
Sqoop has lot of optimizations to fetch data does spark jdbc also has those ?
I'm performing few analytics using spark data for which data is residing in 
rdbms. 
Please guide me with this. 

ThanksVenkata Karthik P 


Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-11 Thread cdecleene
The data is uncorrupted as I can create the dataframe from the underlying raw
parquet from spark 2.0.0 if instead of using SparkSession.sql() to create a
dataframe I use SparkSession.read.parquet(). 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502p27516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-10 Thread cdecleene
Using the scala api instead of the python api yields the same results.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502p27506.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Mich Talebzadeh
Hi,

Is this table created as external table in Hive?

Do you see data through Spark-sql or Hive thrift server.

There is an issue with Zeppelin seeing data when connecting to Spark Thrift
Server. Rows display null value.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 9 August 2016 at 22:32, cdecleene <cd...@allstate.com> wrote:

> Some details of an example table hive table that spark 2.0 could not
> read...
>
> SerDe Library:
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
>
> COLUMN_STATS_ACCURATE   false
> kite.compression.type   snappy
> numFiles0
> numRows -1
> rawDataSize -1
> totalSize0
>
> All fields within the table are of type "string" and there are less than 20
> of them.
>
> When I say that spark 2.0 cannot read the hive table, I mean that when I
> attempt to execute the following from a pyspark shell...
>
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim")
>
> ... the dataframe df has the correct number of rows and the correct
> columns,
> but all values read as "None".
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-
> created-with-sqoop-but-Spark-2-0-0-cannot-tp27502.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread Davies Liu
Can you get all the fields back using Scala or SQL (bin/spark-sql)?

On Tue, Aug 9, 2016 at 2:32 PM, cdecleene <cd...@allstate.com> wrote:
> Some details of an example table hive table that spark 2.0 could not read...
>
> SerDe Library:
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
>
> COLUMN_STATS_ACCURATE   false
> kite.compression.type   snappy
> numFiles0
> numRows -1
> rawDataSize -1
> totalSize0
>
> All fields within the table are of type "string" and there are less than 20
> of them.
>
> When I say that spark 2.0 cannot read the hive table, I mean that when I
> attempt to execute the following from a pyspark shell...
>
> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
> df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim")
>
> ... the dataframe df has the correct number of rows and the correct columns,
> but all values read as "None".
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-09 Thread cdecleene
Some details of an example table hive table that spark 2.0 could not read...  

SerDe Library:  
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

COLUMN_STATS_ACCURATE   false   
kite.compression.type   snappy  
numFiles0
numRows -1
rawDataSize -1
totalSize0

All fields within the table are of type "string" and there are less than 20
of them. 

When I say that spark 2.0 cannot read the hive table, I mean that when I
attempt to execute the following from a pyspark shell... 

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
df = spark.sql("SELECT * FROM dra_agency_analytics.raw_ewt_agcy_dim")

... the dataframe df has the correct number of rows and the correct columns,
but all values read as "None". 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-can-read-hive-tables-created-with-sqoop-but-Spark-2-0-0-cannot-tp27502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Sqoop On Spark

2016-08-01 Thread Takeshi Yamamuro
Hi,

Have you seen this previous thread?
https://www.mail-archive.com/user@spark.apache.org/msg49025.html
I'm not sure this is what you want though.

// maropu


On Tue, Aug 2, 2016 at 1:52 PM, Selvam Raman <sel...@gmail.com> wrote:

>  Hi Team,
>
> how can i use spark as execution engine in sqoop2. i see the patch(S
> QOOP-1532 <https://issues.apache.org/jira/browse/SQOOP-1532>) but it
> shows in progess.
>
> so can not we use sqoop on spark.
>
> Please help me if you have an any idea.
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>



-- 
---
Takeshi Yamamuro


Sqoop On Spark

2016-08-01 Thread Selvam Raman
 Hi Team,

how can i use spark as execution engine in sqoop2. i see the patch(S
QOOP-1532 <https://issues.apache.org/jira/browse/SQOOP-1532>) but it shows
in progess.

so can not we use sqoop on spark.

Please help me if you have an any idea.

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Sqoop on Spark

2016-04-14 Thread Mich Talebzadeh
t and
>>> then moving it to HDFS and then doing a bulk load to be more efficient.
>>> (This is less flexible than sqoop, but also stresses the database
>>> servers less. )
>>>
>>> Again, YMMV
>>>
>>>
>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Well unless you have plenty of memory, you are going to have certain
>>> issues with Spark.
>>>
>>> I tried to load a billion rows table from oracle through spark using
>>> JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap
>>> space" error.
>>>
>>> Sqoop uses MapR and does it in serial mode which takes time and you can
>>> also tell it to create Hive table. However, it will import data into Hive
>>> table.
>>>
>>> In any case the mechanism of data import is through JDBC, Spark uses
>>> memory and DAG, whereas Sqoop relies on MapR.
>>>
>>> There is of course another alternative.
>>>
>>> Assuming that your Oracle table has a primary Key say "ID" (it would be
>>> easier if it was a monotonically increasing number) or already partitioned.
>>>
>>>
>>>1. You can create views based on the range of ID or for each
>>>partition. You can then SELECT COLUMNS  co1, col2, coln from view and 
>>> spool
>>>it to a text file on OS (locally say backup directory would be fastest).
>>>2. bzip2 those files and scp them to a local directory in Hadoop
>>>3. You can then use Spark/hive to load the target table from local
>>>files in parallel
>>>4. When creating views take care of NUMBER and CHAR columns in
>>>Oracle and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln
>>>AS VARCHAR2(n)) AS coln etc
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Some metrics thrown around the discussion:
>>>>
>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size
>>>> 21 GB)
>>>> SPARK: load the data into memory (15 mins)
>>>>
>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to
>>>> load 500 million records - manually killed after 8 hours.
>>>>
>>>> (both the above studies were done in a system of same capacity, with 32
>>>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running
>>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file system)
>>>>
>>>> In case any one needs to know what needs to be done to access both the
>>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com>
>>>> wrote:
>>>>
>>>>> Good to know that.
>>>>>
>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor
>>>>> specific feature.
>>>>>
>>>>> But for MPP, I still think it makes sense that vendor provide some
>>>>> kind of InputFormat, or data source in Spark, so Hadoop eco-system can
>>>>> integrate with them more natively.
>>>>>
>>>>> Yong
>>>>>
>>>>> --
>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>> Subject: Re: Sqoop on Spark
>>>>> From: mohaj...@gmail.com
>>>>> To: java8...@hotmail.com
>>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com;
>>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com;
>>>>> user@spark.apache.org
>>>>>
>>>>>
>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>>
>>>>> http://developer.teradata.com/conne

Re: Sqoop on Spark

2016-04-14 Thread Jörn Franke
pping it and 
>>>>>> then moving it to HDFS and then doing a bulk load to be more efficient.
>>>>>> (This is less flexible than sqoop, but also stresses the database 
>>>>>> servers less. ) 
>>>>>> 
>>>>>> Again, YMMV
>>>>>> 
>>>>>> 
>>>>>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> Well unless you have plenty of memory, you are going to have certain 
>>>>>>> issues with Spark.
>>>>>>> 
>>>>>>> I tried to load a billion rows table from oracle through spark using 
>>>>>>> JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java 
>>>>>>> heap space" error.
>>>>>>> 
>>>>>>> Sqoop uses MapR and does it in serial mode which takes time and you can 
>>>>>>> also tell it to create Hive table. However, it will import data into 
>>>>>>> Hive table.
>>>>>>> 
>>>>>>> In any case the mechanism of data import is through JDBC, Spark uses 
>>>>>>> memory and DAG, whereas Sqoop relies on MapR.
>>>>>>> 
>>>>>>> There is of course another alternative.
>>>>>>> 
>>>>>>> Assuming that your Oracle table has a primary Key say "ID" (it would be 
>>>>>>> easier if it was a monotonically increasing number) or already 
>>>>>>> partitioned.
>>>>>>> 
>>>>>>> You can create views based on the range of ID or for each partition. 
>>>>>>> You can then SELECT COLUMNS  co1, col2, coln from view and spool it to 
>>>>>>> a text file on OS (locally say backup directory would be fastest).
>>>>>>> bzip2 those files and scp them to a local directory in Hadoop
>>>>>>> You can then use Spark/hive to load the target table from local files 
>>>>>>> in parallel
>>>>>>> When creating views take care of NUMBER and CHAR columns in Oracle and 
>>>>>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS 
>>>>>>> VARCHAR2(n)) AS coln etc 
>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  
>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>  
>>>>>>> 
>>>>>>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Some metrics thrown around the discussion:
>>>>>>>> 
>>>>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 
>>>>>>>> 21 GB)
>>>>>>>> SPARK: load the data into memory (15 mins)
>>>>>>>> 
>>>>>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to 
>>>>>>>> load 500 million records - manually killed after 8 hours.
>>>>>>>> 
>>>>>>>> (both the above studies were done in a system of same capacity, with 
>>>>>>>> 32 GB RAM and dual hexacore Xeon processors and SSD. SPARK was running 
>>>>>>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file 
>>>>>>>> system)
>>>>>>>> 
>>>>>>>> In case any one needs to know what needs to be done to access both the 
>>>>>>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Gourav Sengupta
>>>>>>>> 
>>>>>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> Good to know that.
>>>>>>>&

Re: Sqoop on Spark

2016-04-14 Thread Gourav Sengupta
reasing number) or already partitioned.
>>
>>
>>1. You can create views based on the range of ID or for each
>>partition. You can then SELECT COLUMNS  co1, col2, coln from view and 
>> spool
>>it to a text file on OS (locally say backup directory would be fastest).
>>2. bzip2 those files and scp them to a local directory in Hadoop
>>3. You can then use Spark/hive to load the target table from local
>>files in parallel
>>4. When creating views take care of NUMBER and CHAR columns in Oracle
>>and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS
>>VARCHAR2(n)) AS coln etc
>>
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Some metrics thrown around the discussion:
>>>
>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21
>>> GB)
>>> SPARK: load the data into memory (15 mins)
>>>
>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load
>>> 500 million records - manually killed after 8 hours.
>>>
>>> (both the above studies were done in a system of same capacity, with 32
>>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running
>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file system)
>>>
>>> In case any one needs to know what needs to be done to access both the
>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com>
>>> wrote:
>>>
>>>> Good to know that.
>>>>
>>>> That is why Sqoop has this "direct" mode, to utilize the vendor
>>>> specific feature.
>>>>
>>>> But for MPP, I still think it makes sense that vendor provide some kind
>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can integrate
>>>> with them more natively.
>>>>
>>>> Yong
>>>>
>>>> --
>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>> Subject: Re: Sqoop on Spark
>>>> From: mohaj...@gmail.com
>>>> To: java8...@hotmail.com
>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com;
>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com;
>>>> user@spark.apache.org
>>>>
>>>>
>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>
>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>
>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for
>>>> doing Sqoop is parallelized and works with ORC and probably other formats
>>>> as well. It is using JDBC for each connection between data-nodes and their
>>>> AMP (compute) nodes. There is an additional layer that coordinates all of
>>>> it.
>>>> I know Oracle has a similar technology I've used it and had to supply
>>>> the JDBC driver.
>>>>
>>>> Teradata Connector is for batch data copy, QueryGrid is for interactive
>>>> data movement.
>>>>
>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com>
>>>> wrote:
>>>>
>>>> If they do that, they must provide a customized input format, instead
>>>> of through JDBC.
>>>>
>>>> Yong
>>>>
>>>> --
>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>> Subject: Re: Sqoop on Spark
>>>> From: mich.talebza...@gmail.com
>>>> To: mohaj...@gmail.com
>>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com;
>>>> guha.a...@gmail.com; linguin@gmail.com; user@spark.apache.org
>>>>
>>>>
>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>

Re: Sqoop on Spark

2016-04-11 Thread Jörn Franke
> memory and DAG, whereas Sqoop relies on MapR.
>>>>> 
>>>>> There is of course another alternative.
>>>>> 
>>>>> Assuming that your Oracle table has a primary Key say "ID" (it would be 
>>>>> easier if it was a monotonically increasing number) or already 
>>>>> partitioned.
>>>>> 
>>>>> You can create views based on the range of ID or for each partition. You 
>>>>> can then SELECT COLUMNS  co1, col2, coln from view and spool it to a text 
>>>>> file on OS (locally say backup directory would be fastest). 
>>>>> bzip2 those files and scp them to a local directory in Hadoop
>>>>> You can then use Spark/hive to load the target table from local files in 
>>>>> parallel
>>>>> When creating views take care of NUMBER and CHAR columns in Oracle and 
>>>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS 
>>>>> VARCHAR2(n)) AS coln etc 
>>>>> 
>>>>> HTH
>>>>> 
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> 
>>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Some metrics thrown around the discussion:
>>>>>> 
>>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 
>>>>>> GB)
>>>>>> SPARK: load the data into memory (15 mins)
>>>>>> 
>>>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load 
>>>>>> 500 million records - manually killed after 8 hours.
>>>>>> 
>>>>>> (both the above studies were done in a system of same capacity, with 32 
>>>>>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running 
>>>>>> locally, and SQOOP ran on HADOOP2 and extracted data to local file 
>>>>>> system)
>>>>>> 
>>>>>> In case any one needs to know what needs to be done to access both the 
>>>>>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Gourav Sengupta
>>>>>> 
>>>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> 
>>>>>>> wrote:
>>>>>>> Good to know that.
>>>>>>> 
>>>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor 
>>>>>>> specific feature.
>>>>>>> 
>>>>>>> But for MPP, I still think it makes sense that vendor provide some kind 
>>>>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can 
>>>>>>> integrate with them more natively.
>>>>>>> 
>>>>>>> Yong
>>>>>>> 
>>>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>> From: mohaj...@gmail.com
>>>>>>> To: java8...@hotmail.com
>>>>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com; 
>>>>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com; 
>>>>>>> user@spark.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>>>> 
>>>>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for 
>>>>>>> doing Sqoop is parallelized and works with ORC and probably other 
>>>>>>> formats as well. It is using JDBC for each connection between 
>>>>>>> data-nodes and their AMP (compute) nodes. There is an additional layer 
>>>>>>> that coordinates all of it.
>>>>>>> I know Oracle has a similar technology I've used it and had to supply 

Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
o a text 
>>> file on OS (locally say backup directory would be fastest).
>>> bzip2 those files and scp them to a local directory in Hadoop
>>> You can then use Spark/hive to load the target table from local files in 
>>> parallel
>>> When creating views take care of NUMBER and CHAR columns in Oracle and 
>>> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS 
>>> VARCHAR2(n)) AS coln etc 
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com 
>>> <mailto:gourav.sengu...@gmail.com>> wrote:
>>> Hi,
>>> 
>>> Some metrics thrown around the discussion:
>>> 
>>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 GB)
>>> SPARK: load the data into memory (15 mins)
>>> 
>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load 
>>> 500 million records - manually killed after 8 hours.
>>> 
>>> (both the above studies were done in a system of same capacity, with 32 GB 
>>> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, 
>>> and SQOOP ran on HADOOP2 and extracted data to local file system)
>>> 
>>> In case any one needs to know what needs to be done to access both the CSV 
>>> and JDBC modules in SPARK Local Server mode, please let me know.
>>> 
>>> 
>>> Regards,
>>> Gourav Sengupta
>>> 
>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com 
>>> <mailto:java8...@hotmail.com>> wrote:
>>> Good to know that.
>>> 
>>> That is why Sqoop has this "direct" mode, to utilize the vendor specific 
>>> feature.
>>> 
>>> But for MPP, I still think it makes sense that vendor provide some kind of 
>>> InputFormat, or data source in Spark, so Hadoop eco-system can integrate 
>>> with them more natively.
>>> 
>>> Yong
>>> 
>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>> Subject: Re: Sqoop on Spark
>>> From: mohaj...@gmail.com <mailto:mohaj...@gmail.com>
>>> To: java8...@hotmail.com <mailto:java8...@hotmail.com>
>>> CC: mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>; 
>>> jornfra...@gmail.com <mailto:jornfra...@gmail.com>; 
>>> msegel_had...@hotmail.com <mailto:msegel_had...@hotmail.com>; 
>>> guha.a...@gmail.com <mailto:guha.a...@gmail.com>; linguin@gmail.com 
>>> <mailto:linguin@gmail.com>; user@spark.apache.org 
>>> <mailto:user@spark.apache.org>
>>> 
>>> 
>>> It is using JDBC driver, i know that's the case for Teradata:
>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>  
>>> <http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available>
>>> 
>>> Teradata Connector (which is used by Cloudera and Hortonworks) for doing 
>>> Sqoop is parallelized and works with ORC and probably other formats as 
>>> well. It is using JDBC for each connection between data-nodes and their AMP 
>>> (compute) nodes. There is an additional layer that coordinates all of it.
>>> I know Oracle has a similar technology I've used it and had to supply the 
>>> JDBC driver.
>>> 
>>> Teradata Connector is for batch data copy, QueryGrid is for interactive 
>>> data movement.
>>> 
>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com 
>>> <mailto:java8...@hotmail.com>> wrote:
>>> If they do that, they must provide a customized input format, instead of 
>>> through JDBC.
>>> 
>>> Yong
>>> 
>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>> Subject: Re: Sqoop on Spark
>>> From: mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>
>>> To: mohaj...@gmail.com <mailto:mohaj...@gmail.com>
>>> CC: jornfra...@gmail.com <mailto:jornfra...@gmail.com>; 
>>> msegel_had...@hotmail.com <mailto:msegel_had...@hotmail.com>; 
>>&

Re: Sqoop on Spark

2016-04-10 Thread Jörn Franke
h the above studies were done in a system of same capacity, with 32 GB 
>>>> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, 
>>>> and SQOOP ran on HADOOP2 and extracted data to local file system)
>>>> 
>>>> In case any one needs to know what needs to be done to access both the CSV 
>>>> and JDBC modules in SPARK Local Server mode, please let me know.
>>>> 
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote:
>>>>> Good to know that.
>>>>> 
>>>>> That is why Sqoop has this "direct" mode, to utilize the vendor specific 
>>>>> feature.
>>>>> 
>>>>> But for MPP, I still think it makes sense that vendor provide some kind 
>>>>> of InputFormat, or data source in Spark, so Hadoop eco-system can 
>>>>> integrate with them more natively.
>>>>> 
>>>>> Yong
>>>>> 
>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>> Subject: Re: Sqoop on Spark
>>>>> From: mohaj...@gmail.com
>>>>> To: java8...@hotmail.com
>>>>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com; 
>>>>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com; 
>>>>> user@spark.apache.org
>>>>> 
>>>>> 
>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>> 
>>>>> Teradata Connector (which is used by Cloudera and Hortonworks) for doing 
>>>>> Sqoop is parallelized and works with ORC and probably other formats as 
>>>>> well. It is using JDBC for each connection between data-nodes and their 
>>>>> AMP (compute) nodes. There is an additional layer that coordinates all of 
>>>>> it.
>>>>> I know Oracle has a similar technology I've used it and had to supply the 
>>>>> JDBC driver.
>>>>> 
>>>>> Teradata Connector is for batch data copy, QueryGrid is for interactive 
>>>>> data movement.
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote:
>>>>> If they do that, they must provide a customized input format, instead of 
>>>>> through JDBC.
>>>>> 
>>>>> Yong
>>>>> 
>>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>>> Subject: Re: Sqoop on Spark
>>>>> From: mich.talebza...@gmail.com
>>>>> To: mohaj...@gmail.com
>>>>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; 
>>>>> linguin@gmail.com; user@spark.apache.org
>>>>> 
>>>>> 
>>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>> 
>>>>>  
>>>>> 
>>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
>>>>> For some MPP relational stores (not operational) it maybe feasible to run 
>>>>> Spark jobs and also have data locality. I know QueryGrid (Teradata) and 
>>>>> PolyBase (microsoft) use data locality to move data between their MPP and 
>>>>> Hadoop. 
>>>>> I would guess (have no idea) someone like IBM already is doing that for 
>>>>> Spark, maybe a bit off topic!
>>>>> 
>>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>> Well I am not sure, but using a database as a storage, such as relational 
>>>>> databases or certain nosql databases (eg MongoDB) for Spark is generally 
>>>>> a bad idea - no data locality, it cannot handle real big data volumes for 
>>>>> compute and you may potentially overload an operational database. 
>>>>> And if your job fails for whatever reason (eg scheduling ) then you have 
>>>>> to pull everything out again. Sqoop and HDFS seems to me the more elegant 
>>>>> solution toget

Re: Sqoop on Spark

2016-04-10 Thread Mich Talebzadeh
Yes I meant MR.

Again one cannot beat the RDBMS export utility. I was specifically
referring to Oracle in above case that does not provide any specific text
bases export except the binary one Exp, data pump etc).

In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that
can be parallelised either through range partitioning or simple round robin
partitioning that can be used to get data out to file in parallel. Then
once get data into Hive table through import etc.

In general if the source table is very large you can used either SAP
Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both
these replication tools provide connectors to Hive and they do a good job.
If one has something like Oracle in Prod then there is likely a Golden Gate
there. For bulk setting of Hive tables and data migration, replication
server is good option.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 10 April 2016 at 14:24, Michael Segel <msegel_had...@hotmail.com> wrote:

> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce)
>
> The largest problem with sqoop is that in order to gain parallelism you
> need to know how your underlying table is partitioned and to do multiple
> range queries. This may not be known, or your data may or may not be
> equally distributed across the ranges.
>
> If you’re bringing over the entire table, you may find dropping it and
> then moving it to HDFS and then doing a bulk load to be more efficient.
> (This is less flexible than sqoop, but also stresses the database servers
> less. )
>
> Again, YMMV
>
>
> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> Well unless you have plenty of memory, you are going to have certain
> issues with Spark.
>
> I tried to load a billion rows table from oracle through spark using JDBC
> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space"
> error.
>
> Sqoop uses MapR and does it in serial mode which takes time and you can
> also tell it to create Hive table. However, it will import data into Hive
> table.
>
> In any case the mechanism of data import is through JDBC, Spark uses
> memory and DAG, whereas Sqoop relies on MapR.
>
> There is of course another alternative.
>
> Assuming that your Oracle table has a primary Key say "ID" (it would be
> easier if it was a monotonically increasing number) or already partitioned.
>
>
>1. You can create views based on the range of ID or for each
>partition. You can then SELECT COLUMNS  co1, col2, coln from view and spool
>it to a text file on OS (locally say backup directory would be fastest).
>2. bzip2 those files and scp them to a local directory in Hadoop
>3. You can then use Spark/hive to load the target table from local
>files in parallel
>4. When creating views take care of NUMBER and CHAR columns in Oracle
>and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS
>VARCHAR2(n)) AS coln etc
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> Some metrics thrown around the discussion:
>>
>> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21
>> GB)
>> SPARK: load the data into memory (15 mins)
>>
>> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load
>> 500 million records - manually killed after 8 hours.
>>
>> (both the above studies were done in a system of same capacity, with 32
>> GB RAM and dual hexacore Xeon processors and SSD. SPARK was running
>> locally, and SQOOP ran on HADOOP2 and extracted data to local file system)
>>
>> In case any one needs to know what needs to be done to access both the
>> CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote:
>>
>>> Good to know that.
>>>
>>> That is why Sqoop has this "direct" mode, to utilize the vendor specific
>>> feature.
>>>
>>> But for MPP, I still think it makes 

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) 

The largest problem with sqoop is that in order to gain parallelism you need to 
know how your underlying table is partitioned and to do multiple range queries. 
This may not be known, or your data may or may not be equally distributed 
across the ranges.  

If you’re bringing over the entire table, you may find dropping it and then 
moving it to HDFS and then doing a bulk load to be more efficient.
(This is less flexible than sqoop, but also stresses the database servers less. 
) 

Again, YMMV


> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Well unless you have plenty of memory, you are going to have certain issues 
> with Spark.
> 
> I tried to load a billion rows table from oracle through spark using JDBC and 
> ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" error.
> 
> Sqoop uses MapR and does it in serial mode which takes time and you can also 
> tell it to create Hive table. However, it will import data into Hive table.
> 
> In any case the mechanism of data import is through JDBC, Spark uses memory 
> and DAG, whereas Sqoop relies on MapR.
> 
> There is of course another alternative.
> 
> Assuming that your Oracle table has a primary Key say "ID" (it would be 
> easier if it was a monotonically increasing number) or already partitioned.
> 
> You can create views based on the range of ID or for each partition. You can 
> then SELECT COLUMNS  co1, col2, coln from view and spool it to a text file on 
> OS (locally say backup directory would be fastest).
> bzip2 those files and scp them to a local directory in Hadoop
> You can then use Spark/hive to load the target table from local files in 
> parallel
> When creating views take care of NUMBER and CHAR columns in Oracle and 
> convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS VARCHAR2(n)) 
> AS coln etc 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com 
> <mailto:gourav.sengu...@gmail.com>> wrote:
> Hi,
> 
> Some metrics thrown around the discussion:
> 
> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 GB)
> SPARK: load the data into memory (15 mins)
> 
> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load 500 
> million records - manually killed after 8 hours.
> 
> (both the above studies were done in a system of same capacity, with 32 GB 
> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, and 
> SQOOP ran on HADOOP2 and extracted data to local file system)
> 
> In case any one needs to know what needs to be done to access both the CSV 
> and JDBC modules in SPARK Local Server mode, please let me know.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com 
> <mailto:java8...@hotmail.com>> wrote:
> Good to know that.
> 
> That is why Sqoop has this "direct" mode, to utilize the vendor specific 
> feature.
> 
> But for MPP, I still think it makes sense that vendor provide some kind of 
> InputFormat, or data source in Spark, so Hadoop eco-system can integrate with 
> them more natively.
> 
> Yong
> 
> Date: Wed, 6 Apr 2016 16:12:30 -0700
> Subject: Re: Sqoop on Spark
> From: mohaj...@gmail.com <mailto:mohaj...@gmail.com>
> To: java8...@hotmail.com <mailto:java8...@hotmail.com>
> CC: mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>; 
> jornfra...@gmail.com <mailto:jornfra...@gmail.com>; msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>; guha.a...@gmail.com 
> <mailto:guha.a...@gmail.com>; linguin@gmail.com 
> <mailto:linguin@gmail.com>; user@spark.apache.org 
> <mailto:user@spark.apache.org>
> 
> 
> It is using JDBC driver, i know that's the case for Teradata:
> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>  
> <http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available>
> 
> Teradata Connector (which is used by Cloudera and Hortonworks) for doing 
> Sqoop is parallelized and works with ORC and probably other formats as well. 
> It is using JDBC for each connection between data-nodes and their AMP 
> (compute) nodes. 

Re: Sqoop on Spark

2016-04-08 Thread Mich Talebzadeh
Well unless you have plenty of memory, you are going to have certain issues
with Spark.

I tried to load a billion rows table from oracle through spark using JDBC
and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space"
error.

Sqoop uses MapR and does it in serial mode which takes time and you can
also tell it to create Hive table. However, it will import data into Hive
table.

In any case the mechanism of data import is through JDBC, Spark uses memory
and DAG, whereas Sqoop relies on MapR.

There is of course another alternative.

Assuming that your Oracle table has a primary Key say "ID" (it would be
easier if it was a monotonically increasing number) or already partitioned.


   1. You can create views based on the range of ID or for each partition.
   You can then SELECT COLUMNS  co1, col2, coln from view and spool it to a
   text file on OS (locally say backup directory would be fastest).
   2. bzip2 those files and scp them to a local directory in Hadoop
   3. You can then use Spark/hive to load the target table from local files
   in parallel
   4. When creating views take care of NUMBER and CHAR columns in Oracle
   and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS
   VARCHAR2(n)) AS coln etc


HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:

> Hi,
>
> Some metrics thrown around the discussion:
>
> SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21
> GB)
> SPARK: load the data into memory (15 mins)
>
> SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load
> 500 million records - manually killed after 8 hours.
>
> (both the above studies were done in a system of same capacity, with 32 GB
> RAM and dual hexacore Xeon processors and SSD. SPARK was running locally,
> and SQOOP ran on HADOOP2 and extracted data to local file system)
>
> In case any one needs to know what needs to be done to access both the CSV
> and JDBC modules in SPARK Local Server mode, please let me know.
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote:
>
>> Good to know that.
>>
>> That is why Sqoop has this "direct" mode, to utilize the vendor specific
>> feature.
>>
>> But for MPP, I still think it makes sense that vendor provide some kind
>> of InputFormat, or data source in Spark, so Hadoop eco-system can integrate
>> with them more natively.
>>
>> Yong
>>
>> --
>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>> Subject: Re: Sqoop on Spark
>> From: mohaj...@gmail.com
>> To: java8...@hotmail.com
>> CC: mich.talebza...@gmail.com; jornfra...@gmail.com;
>> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com;
>> user@spark.apache.org
>>
>>
>> It is using JDBC driver, i know that's the case for Teradata:
>>
>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>
>> Teradata Connector (which is used by Cloudera and Hortonworks) for doing
>> Sqoop is parallelized and works with ORC and probably other formats as
>> well. It is using JDBC for each connection between data-nodes and their AMP
>> (compute) nodes. There is an additional layer that coordinates all of it.
>> I know Oracle has a similar technology I've used it and had to supply the
>> JDBC driver.
>>
>> Teradata Connector is for batch data copy, QueryGrid is for interactive
>> data movement.
>>
>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote:
>>
>> If they do that, they must provide a customized input format, instead of
>> through JDBC.
>>
>> Yong
>>
>> --
>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>> Subject: Re: Sqoop on Spark
>> From: mich.talebza...@gmail.com
>> To: mohaj...@gmail.com
>> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com;
>> linguin@gmail.com; user@spark.apache.org
>>
>>
>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>

Re: Sqoop on Spark

2016-04-08 Thread Gourav Sengupta
Hi,

Some metrics thrown around the discussion:

SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 GB)
SPARK: load the data into memory (15 mins)

SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load
500 million records - manually killed after 8 hours.

(both the above studies were done in a system of same capacity, with 32 GB
RAM and dual hexacore Xeon processors and SSD. SPARK was running locally,
and SQOOP ran on HADOOP2 and extracted data to local file system)

In case any one needs to know what needs to be done to access both the CSV
and JDBC modules in SPARK Local Server mode, please let me know.


Regards,
Gourav Sengupta

On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8...@hotmail.com> wrote:

> Good to know that.
>
> That is why Sqoop has this "direct" mode, to utilize the vendor specific
> feature.
>
> But for MPP, I still think it makes sense that vendor provide some kind of
> InputFormat, or data source in Spark, so Hadoop eco-system can integrate
> with them more natively.
>
> Yong
>
> --
> Date: Wed, 6 Apr 2016 16:12:30 -0700
> Subject: Re: Sqoop on Spark
> From: mohaj...@gmail.com
> To: java8...@hotmail.com
> CC: mich.talebza...@gmail.com; jornfra...@gmail.com;
> msegel_had...@hotmail.com; guha.a...@gmail.com; linguin@gmail.com;
> user@spark.apache.org
>
>
> It is using JDBC driver, i know that's the case for Teradata:
>
> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>
> Teradata Connector (which is used by Cloudera and Hortonworks) for doing
> Sqoop is parallelized and works with ORC and probably other formats as
> well. It is using JDBC for each connection between data-nodes and their AMP
> (compute) nodes. There is an additional layer that coordinates all of it.
> I know Oracle has a similar technology I've used it and had to supply the
> JDBC driver.
>
> Teradata Connector is for batch data copy, QueryGrid is for interactive
> data movement.
>
> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote:
>
> If they do that, they must provide a customized input format, instead of
> through JDBC.
>
> Yong
>
> --
> Date: Wed, 6 Apr 2016 23:56:54 +0100
> Subject: Re: Sqoop on Spark
> From: mich.talebza...@gmail.com
> To: mohaj...@gmail.com
> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com;
> linguin@gmail.com; user@spark.apache.org
>
>
> SAP Sybase IQ does that and I believe SAP Hana as well.
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
>
> For some MPP relational stores (not operational) it maybe feasible to run
> Spark jobs and also have data locality. I know QueryGrid (Teradata) and
> PolyBase (microsoft) use data locality to move data between their MPP and
> Hadoop.
> I would guess (have no idea) someone like IBM already is doing that for
> Spark, maybe a bit off topic!
>
> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Well I am not sure, but using a database as a storage, such as relational
> databases or certain nosql databases (eg MongoDB) for Spark is generally a
> bad idea - no data locality, it cannot handle real big data volumes for
> compute and you may potentially overload an operational database.
> And if your job fails for whatever reason (eg scheduling ) then you have
> to pull everything out again. Sqoop and HDFS seems to me the more elegant
> solution together with spark. These "assumption" on parallelism have to be
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you
> expect? A real big data platform has to support anyway many different tools
> otherwise people doing analytics will be limited.
>
> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote:
>
> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink thei

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
Good to know that.
That is why Sqoop has this "direct" mode, to utilize the vendor specific 
feature.
But for MPP, I still think it makes sense that vendor provide some kind of 
InputFormat, or data source in Spark, so Hadoop eco-system can integrate with 
them more natively.
Yong

Date: Wed, 6 Apr 2016 16:12:30 -0700
Subject: Re: Sqoop on Spark
From: mohaj...@gmail.com
To: java8...@hotmail.com
CC: mich.talebza...@gmail.com; jornfra...@gmail.com; msegel_had...@hotmail.com; 
guha.a...@gmail.com; linguin@gmail.com; user@spark.apache.org

It is using JDBC driver, i know that's the case for 
Teradata:http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available

Teradata Connector (which is used by Cloudera and Hortonworks) for doing Sqoop 
is parallelized and works with ORC and probably other formats as well. It is 
using JDBC for each connection between data-nodes and their AMP (compute) 
nodes. There is an additional layer that coordinates all of it.I know Oracle 
has a similar technology I've used it and had to supply the JDBC driver.
Teradata Connector is for batch data copy, QueryGrid is for interactive data 
movement.
On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote:



If they do that, they must provide a customized input format, instead of 
through JDBC.
Yong

Date: Wed, 6 Apr 2016 23:56:54 +0100
Subject: Re: Sqoop on Spark
From: mich.talebza...@gmail.com
To: mohaj...@gmail.com
CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; 
linguin@gmail.com; user@spark.apache.org

SAP Sybase IQ does that and I believe SAP Hana as well.

Dr Mich Talebzadeh


 


LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw


 


http://talebzadehmich.wordpress.com

 




On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
For some MPP relational stores (not operational) it maybe feasible to run Spark 
jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase 
(microsoft) use data locality to move data between their MPP and Hadoop. I 
would guess (have no idea) someone like IBM already is doing that for Spark, 
maybe a bit off topic!
On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
Well I am not sure, but using a database as a storage, such as relational 
databases or certain nosql databases (eg MongoDB) for Spark is generally a bad 
idea - no data locality, it cannot handle real big data volumes for compute and 
you may potentially overload an operational database. And if your job fails for 
whatever reason (eg scheduling ) then you have to pull everything out again. 
Sqoop and HDFS seems to me the more elegant solution together with spark. These 
"assumption" on parallelism have to be anyway made with any solution.Of course 
you can always redo things, but why - what benefit do you expect? A real big 
data platform has to support anyway many different tools otherwise people doing 
analytics will be limited. 
On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote:

I don’t think its necessarily a bad idea.
Sqoop is an ugly tool and it requires you to make some assumptions as a way to 
gain parallelism. (Not that most of the assumptions are not valid for most of 
the use cases…) 
Depending on what you want to do… your data may not be persisted on HDFS.  
There are use cases where your cluster is used for compute and not storage.
I’d say that spending time re-inventing the wheel can be a good thing. It would 
be a good idea for many to rethink their ingestion process so that they can 
have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term from Dean 
Wampler. ;-) 
Just saying. ;-) 
-Mike
On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
I do not think you can be more resource efficient. In the end you have to store 
the data anyway on HDFS . You have a lot of development effort for doing 
something like sqoop. Especially with error handling. You may create a ticket 
with the Sqoop guys to support Spark as an execution engine and maybe it is 
less effort to plug it in there.Maybe if your cluster is loaded then you may 
want to add more machines or improve the existing programs.
On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:

One of the reason in my mind is to avoid Map-Reduce application completely 
during ingestion, if possible. Also, I can then use Spark stand alone cluster 
to ingest, even if my hadoop cluster is heavily loaded. What you guys think?
On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
Why do you want to reimplement something which is already there?
On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:

Hi
Thanks for reply. My use case is query ~40 tables from Oracle (using index and 
incremental only) and add data to existing Hive tables. Also, it 

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
It is using JDBC driver, i know that's the case for Teradata:
http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available

Teradata Connector (which is used by Cloudera and Hortonworks) for doing
Sqoop is parallelized and works with ORC and probably other formats as
well. It is using JDBC for each connection between data-nodes and their AMP
(compute) nodes. There is an additional layer that coordinates all of it.
I know Oracle has a similar technology I've used it and had to supply the
JDBC driver.

Teradata Connector is for batch data copy, QueryGrid is for interactive
data movement.

On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote:

> If they do that, they must provide a customized input format, instead of
> through JDBC.
>
> Yong
>
> --
> Date: Wed, 6 Apr 2016 23:56:54 +0100
> Subject: Re: Sqoop on Spark
> From: mich.talebza...@gmail.com
> To: mohaj...@gmail.com
> CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com;
> linguin@gmail.com; user@spark.apache.org
>
>
> SAP Sybase IQ does that and I believe SAP Hana as well.
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
>
> For some MPP relational stores (not operational) it maybe feasible to run
> Spark jobs and also have data locality. I know QueryGrid (Teradata) and
> PolyBase (microsoft) use data locality to move data between their MPP and
> Hadoop.
> I would guess (have no idea) someone like IBM already is doing that for
> Spark, maybe a bit off topic!
>
> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Well I am not sure, but using a database as a storage, such as relational
> databases or certain nosql databases (eg MongoDB) for Spark is generally a
> bad idea - no data locality, it cannot handle real big data volumes for
> compute and you may potentially overload an operational database.
> And if your job fails for whatever reason (eg scheduling ) then you have
> to pull everything out again. Sqoop and HDFS seems to me the more elegant
> solution together with spark. These "assumption" on parallelism have to be
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you
> expect? A real big data platform has to support anyway many different tools
> otherwise people doing analytics will be limited.
>
> On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote:
>
> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> Why do you want to reimplement something which is already there?
>
> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>
> Hi
>
> Thanks for reply. My use case is query ~40 tables from Oracle (using index
> and incremental only) and add data to existing Hive tables. Also

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
If they do that, they must provide a customized input format, instead of 
through JDBC.
Yong

Date: Wed, 6 Apr 2016 23:56:54 +0100
Subject: Re: Sqoop on Spark
From: mich.talebza...@gmail.com
To: mohaj...@gmail.com
CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com; 
linguin@gmail.com; user@spark.apache.org

SAP Sybase IQ does that and I believe SAP Hana as well.

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com

 



On 6 April 2016 at 23:49, Peyman Mohajerian <mohaj...@gmail.com> wrote:
For some MPP relational stores (not operational) it maybe feasible to run Spark 
jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase 
(microsoft) use data locality to move data between their MPP and Hadoop. I 
would guess (have no idea) someone like IBM already is doing that for Spark, 
maybe a bit off topic!
On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfra...@gmail.com> wrote:
Well I am not sure, but using a database as a storage, such as relational 
databases or certain nosql databases (eg MongoDB) for Spark is generally a bad 
idea - no data locality, it cannot handle real big data volumes for compute and 
you may potentially overload an operational database. And if your job fails for 
whatever reason (eg scheduling ) then you have to pull everything out again. 
Sqoop and HDFS seems to me the more elegant solution together with spark. These 
"assumption" on parallelism have to be anyway made with any solution.Of course 
you can always redo things, but why - what benefit do you expect? A real big 
data platform has to support anyway many different tools otherwise people doing 
analytics will be limited. 
On 06 Apr 2016, at 20:05, Michael Segel <msegel_had...@hotmail.com> wrote:

I don’t think its necessarily a bad idea.
Sqoop is an ugly tool and it requires you to make some assumptions as a way to 
gain parallelism. (Not that most of the assumptions are not valid for most of 
the use cases…) 
Depending on what you want to do… your data may not be persisted on HDFS.  
There are use cases where your cluster is used for compute and not storage.
I’d say that spending time re-inventing the wheel can be a good thing. It would 
be a good idea for many to rethink their ingestion process so that they can 
have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term from Dean 
Wampler. ;-) 
Just saying. ;-) 
-Mike
On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
I do not think you can be more resource efficient. In the end you have to store 
the data anyway on HDFS . You have a lot of development effort for doing 
something like sqoop. Especially with error handling. You may create a ticket 
with the Sqoop guys to support Spark as an execution engine and maybe it is 
less effort to plug it in there.Maybe if your cluster is loaded then you may 
want to add more machines or improve the existing programs.
On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:

One of the reason in my mind is to avoid Map-Reduce application completely 
during ingestion, if possible. Also, I can then use Spark stand alone cluster 
to ingest, even if my hadoop cluster is heavily loaded. What you guys think?
On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
Why do you want to reimplement something which is already there?
On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:

Hi
Thanks for reply. My use case is query ~40 tables from Oracle (using index and 
incremental only) and add data to existing Hive tables. Also, it would be good 
to have an option to create Hive table, driven by job specific configuration. 
What do you think?
BestAyan
On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin@gmail.com> wrote:
Hi,
It depends on your use case using sqoop.What's it like?
// maropu
On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
Hi All
Asking opinion: is it possible/advisable to use spark to replace what sqoop 
does? Any existing project done in similar lines?
-- 
Best Regards,
Ayan Guha




-- 
---
Takeshi Yamamuro




-- 
Best Regards,
Ayan Guha




-- 
Best Regards,
Ayan Guha






  

Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
SAP Sybase IQ does that and I believe SAP Hana as well.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 6 April 2016 at 23:49, Peyman Mohajerian  wrote:

> For some MPP relational stores (not operational) it maybe feasible to run
> Spark jobs and also have data locality. I know QueryGrid (Teradata) and
> PolyBase (microsoft) use data locality to move data between their MPP and
> Hadoop.
> I would guess (have no idea) someone like IBM already is doing that for
> Spark, maybe a bit off topic!
>
> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke  wrote:
>
>> Well I am not sure, but using a database as a storage, such as relational
>> databases or certain nosql databases (eg MongoDB) for Spark is generally a
>> bad idea - no data locality, it cannot handle real big data volumes for
>> compute and you may potentially overload an operational database.
>> And if your job fails for whatever reason (eg scheduling ) then you have
>> to pull everything out again. Sqoop and HDFS seems to me the more elegant
>> solution together with spark. These "assumption" on parallelism have to be
>> anyway made with any solution.
>> Of course you can always redo things, but why - what benefit do you
>> expect? A real big data platform has to support anyway many different tools
>> otherwise people doing analytics will be limited.
>>
>> On 06 Apr 2016, at 20:05, Michael Segel 
>> wrote:
>>
>> I don’t think its necessarily a bad idea.
>>
>> Sqoop is an ugly tool and it requires you to make some assumptions as a
>> way to gain parallelism. (Not that most of the assumptions are not valid
>> for most of the use cases…)
>>
>> Depending on what you want to do… your data may not be persisted on
>> HDFS.  There are use cases where your cluster is used for compute and not
>> storage.
>>
>> I’d say that spending time re-inventing the wheel can be a good thing.
>> It would be a good idea for many to rethink their ingestion process so
>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
>> that term from Dean Wampler. ;-)
>>
>> Just saying. ;-)
>>
>> -Mike
>>
>> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
>>
>> I do not think you can be more resource efficient. In the end you have to
>> store the data anyway on HDFS . You have a lot of development effort for
>> doing something like sqoop. Especially with error handling.
>> You may create a ticket with the Sqoop guys to support Spark as an
>> execution engine and maybe it is less effort to plug it in there.
>> Maybe if your cluster is loaded then you may want to add more machines or
>> improve the existing programs.
>>
>> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>>
>> One of the reason in my mind is to avoid Map-Reduce application
>> completely during ingestion, if possible. Also, I can then use Spark stand
>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What
>> you guys think?
>>
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
>>
>>> Why do you want to reimplement something which is already there?
>>>
>>> On 06 Apr 2016, at 06:47, ayan guha  wrote:
>>>
>>> Hi
>>>
>>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>>> index and incremental only) and add data to existing Hive tables. Also, it
>>> would be good to have an option to create Hive table, driven by job
>>> specific configuration.
>>>
>>> What do you think?
>>>
>>> Best
>>> Ayan
>>>
>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>>> wrote:
>>>
 Hi,

 It depends on your use case using sqoop.
 What's it like?

 // maropu

 On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:

> Hi All
>
> Asking opinion: is it possible/advisable to use spark to replace what
> sqoop does? Any existing project done in similar lines?
>
> --
> Best Regards,
> Ayan Guha
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>


Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
For some MPP relational stores (not operational) it maybe feasible to run
Spark jobs and also have data locality. I know QueryGrid (Teradata) and
PolyBase (microsoft) use data locality to move data between their MPP and
Hadoop.
I would guess (have no idea) someone like IBM already is doing that for
Spark, maybe a bit off topic!

On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke  wrote:

> Well I am not sure, but using a database as a storage, such as relational
> databases or certain nosql databases (eg MongoDB) for Spark is generally a
> bad idea - no data locality, it cannot handle real big data volumes for
> compute and you may potentially overload an operational database.
> And if your job fails for whatever reason (eg scheduling ) then you have
> to pull everything out again. Sqoop and HDFS seems to me the more elegant
> solution together with spark. These "assumption" on parallelism have to be
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you
> expect? A real big data platform has to support anyway many different tools
> otherwise people doing analytics will be limited.
>
> On 06 Apr 2016, at 20:05, Michael Segel  wrote:
>
> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha  wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:
>>>
 Hi All

 Asking opinion: is it possible/advisable to use spark to replace what
 sqoop does? Any existing project done in similar lines?

 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Sorry are you referring to Hive as a relational Data Warehouse in this
scenario. The assumption here is that data is coming from a relational
 database (Oracle) so IMO the best storage for it in Big Data World is
another DW adaptive to SQL. Spark is a powerful query tool and together
with Hive as a backbone of storage provides a powerful framework to
anything. the performance is pretty fast indeed much faster compared to
MapR that Sqoop uses be default.

Anyway you are not confined to a table in Hive. You can take that data from
JDBC and do whatever is needed. There is no constraint in here.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 6 April 2016 at 23:29, Jörn Franke  wrote:

> Well I am not sure, but using a database as a storage, such as relational
> databases or certain nosql databases (eg MongoDB) for Spark is generally a
> bad idea - no data locality, it cannot handle real big data volumes for
> compute and you may potentially overload an operational database.
> And if your job fails for whatever reason (eg scheduling ) then you have
> to pull everything out again. Sqoop and HDFS seems to me the more elegant
> solution together with spark. These "assumption" on parallelism have to be
> anyway made with any solution.
> Of course you can always redo things, but why - what benefit do you
> expect? A real big data platform has to support anyway many different tools
> otherwise people doing analytics will be limited.
>
> On 06 Apr 2016, at 20:05, Michael Segel  wrote:
>
> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha  wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:
>>>
 Hi All

 Asking opinion: is it possible/advisable to use spark to replace what
 sqoop does? Any existing project done in similar lines?

 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: Sqoop on Spark

2016-04-06 Thread Jörn Franke
Well I am not sure, but using a database as a storage, such as relational 
databases or certain nosql databases (eg MongoDB) for Spark is generally a bad 
idea - no data locality, it cannot handle real big data volumes for compute and 
you may potentially overload an operational database. 
And if your job fails for whatever reason (eg scheduling ) then you have to 
pull everything out again. Sqoop and HDFS seems to me the more elegant solution 
together with spark. These "assumption" on parallelism have to be anyway made 
with any solution.
Of course you can always redo things, but why - what benefit do you expect? A 
real big data platform has to support anyway many different tools otherwise 
people doing analytics will be limited. 

> On 06 Apr 2016, at 20:05, Michael Segel  wrote:
> 
> I don’t think its necessarily a bad idea.
> 
> Sqoop is an ugly tool and it requires you to make some assumptions as a way 
> to gain parallelism. (Not that most of the assumptions are not valid for most 
> of the use cases…) 
> 
> Depending on what you want to do… your data may not be persisted on HDFS.  
> There are use cases where your cluster is used for compute and not storage.
> 
> I’d say that spending time re-inventing the wheel can be a good thing. 
> It would be a good idea for many to rethink their ingestion process so that 
> they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term 
> from Dean Wampler. ;-) 
> 
> Just saying. ;-) 
> 
> -Mike
> 
>> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
>> 
>> I do not think you can be more resource efficient. In the end you have to 
>> store the data anyway on HDFS . You have a lot of development effort for 
>> doing something like sqoop. Especially with error handling. 
>> You may create a ticket with the Sqoop guys to support Spark as an execution 
>> engine and maybe it is less effort to plug it in there.
>> Maybe if your cluster is loaded then you may want to add more machines or 
>> improve the existing programs.
>> 
>>> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>>> 
>>> One of the reason in my mind is to avoid Map-Reduce application completely 
>>> during ingestion, if possible. Also, I can then use Spark stand alone 
>>> cluster to ingest, even if my hadoop cluster is heavily loaded. What you 
>>> guys think?
>>> 
 On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
 Why do you want to reimplement something which is already there?
 
> On 06 Apr 2016, at 06:47, ayan guha  wrote:
> 
> Hi
> 
> Thanks for reply. My use case is query ~40 tables from Oracle (using 
> index and incremental only) and add data to existing Hive tables. Also, 
> it would be good to have an option to create Hive table, driven by job 
> specific configuration. 
> 
> What do you think?
> 
> Best
> Ayan
> 
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro  
>> wrote:
>> Hi,
>> 
>> It depends on your use case using sqoop.
>> What's it like?
>> 
>> // maropu
>> 
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:
>>> Hi All
>>> 
>>> Asking opinion: is it possible/advisable to use spark to replace what 
>>> sqoop does? Any existing project done in similar lines?
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
> 


Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
I just created an example of how to use JDBC to get Oracle data into Hive
table using Sqoop. Please see thread below

How to use Spark JDBC to read from RDBMS table, create Hive ORC table and
save RDBMS data in it

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 April 2016 at 22:41, Ranadip Chatterjee <ranadi...@gmail.com> wrote:

> I know of projects that have done this but have never seen any advantage
> of "using spark to do what sqoop does" - at least in a yarn cluster. Both
> frameworks will have similar overheads of getting the containers allocated
> by yarn and creating new jvms to do the work. Probably spark will have a
> slightly higher overhead due to creation of RDD before writing the data to
> hdfs - something that the sqoop mapper need not do. (So what am I
> overlooking here?)
>
> In cases where a data pipeline is being built with the sqooped data being
> the only trigger, there is a justification for using spark instead of sqoop
> to short circuit the data directly into the transformation pipeline.
>
> Regards
> Ranadip
> On 6 Apr 2016 7:05 p.m., "Michael Segel" <msegel_had...@hotmail.com>
> wrote:
>
>> I don’t think its necessarily a bad idea.
>>
>> Sqoop is an ugly tool and it requires you to make some assumptions as a
>> way to gain parallelism. (Not that most of the assumptions are not valid
>> for most of the use cases…)
>>
>> Depending on what you want to do… your data may not be persisted on
>> HDFS.  There are use cases where your cluster is used for compute and not
>> storage.
>>
>> I’d say that spending time re-inventing the wheel can be a good thing.
>> It would be a good idea for many to rethink their ingestion process so
>> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
>> that term from Dean Wampler. ;-)
>>
>> Just saying. ;-)
>>
>> -Mike
>>
>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>> I do not think you can be more resource efficient. In the end you have to
>> store the data anyway on HDFS . You have a lot of development effort for
>> doing something like sqoop. Especially with error handling.
>> You may create a ticket with the Sqoop guys to support Spark as an
>> execution engine and maybe it is less effort to plug it in there.
>> Maybe if your cluster is loaded then you may want to add more machines or
>> improve the existing programs.
>>
>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>>
>> One of the reason in my mind is to avoid Map-Reduce application
>> completely during ingestion, if possible. Also, I can then use Spark stand
>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What
>> you guys think?
>>
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Why do you want to reimplement something which is already there?
>>>
>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>> Hi
>>>
>>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>>> index and incremental only) and add data to existing Hive tables. Also, it
>>> would be good to have an option to create Hive table, driven by job
>>> specific configuration.
>>>
>>> What do you think?
>>>
>>> Best
>>> Ayan
>>>
>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> It depends on your use case using sqoop.
>>>> What's it like?
>>>>
>>>> // maropu
>>>>
>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>>> sqoop does? Any existing project done in similar lines?
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>


Re: Sqoop on Spark

2016-04-06 Thread Ranadip Chatterjee
I know of projects that have done this but have never seen any advantage of
"using spark to do what sqoop does" - at least in a yarn cluster. Both
frameworks will have similar overheads of getting the containers allocated
by yarn and creating new jvms to do the work. Probably spark will have a
slightly higher overhead due to creation of RDD before writing the data to
hdfs - something that the sqoop mapper need not do. (So what am I
overlooking here?)

In cases where a data pipeline is being built with the sqooped data being
the only trigger, there is a justification for using spark instead of sqoop
to short circuit the data directly into the transformation pipeline.

Regards
Ranadip
On 6 Apr 2016 7:05 p.m., "Michael Segel" <msegel_had...@hotmail.com> wrote:

> I don’t think its necessarily a bad idea.
>
> Sqoop is an ugly tool and it requires you to make some assumptions as a
> way to gain parallelism. (Not that most of the assumptions are not valid
> for most of the use cases…)
>
> Depending on what you want to do… your data may not be persisted on HDFS.
> There are use cases where your cluster is used for compute and not storage.
>
> I’d say that spending time re-inventing the wheel can be a good thing.
> It would be a good idea for many to rethink their ingestion process so
> that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
> that term from Dean Wampler. ;-)
>
> Just saying. ;-)
>
> -Mike
>
> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> Hi All
>>>>
>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>> sqoop does? Any existing project done in similar lines?
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea.

Sqoop is an ugly tool and it requires you to make some assumptions as a way to 
gain parallelism. (Not that most of the assumptions are not valid for most of 
the use cases…) 

Depending on what you want to do… your data may not be persisted on HDFS.  
There are use cases where your cluster is used for compute and not storage.

I’d say that spending time re-inventing the wheel can be a good thing. 
It would be a good idea for many to rethink their ingestion process so that 
they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing that term 
from Dean Wampler. ;-) 

Just saying. ;-) 

-Mike

> On Apr 5, 2016, at 10:44 PM, Jörn Franke  wrote:
> 
> I do not think you can be more resource efficient. In the end you have to 
> store the data anyway on HDFS . You have a lot of development effort for 
> doing something like sqoop. Especially with error handling. 
> You may create a ticket with the Sqoop guys to support Spark as an execution 
> engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or 
> improve the existing programs.
> 
> On 06 Apr 2016, at 07:33, ayan guha  > wrote:
> 
>> One of the reason in my mind is to avoid Map-Reduce application completely 
>> during ingestion, if possible. Also, I can then use Spark stand alone 
>> cluster to ingest, even if my hadoop cluster is heavily loaded. What you 
>> guys think?
>> 
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke > > wrote:
>> Why do you want to reimplement something which is already there?
>> 
>> On 06 Apr 2016, at 06:47, ayan guha > > wrote:
>> 
>>> Hi
>>> 
>>> Thanks for reply. My use case is query ~40 tables from Oracle (using index 
>>> and incremental only) and add data to existing Hive tables. Also, it would 
>>> be good to have an option to create Hive table, driven by job specific 
>>> configuration. 
>>> 
>>> What do you think?
>>> 
>>> Best
>>> Ayan
>>> 
>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro >> > wrote:
>>> Hi,
>>> 
>>> It depends on your use case using sqoop.
>>> What's it like?
>>> 
>>> // maropu
>>> 
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha >> > wrote:
>>> Hi All
>>> 
>>> Asking opinion: is it possible/advisable to use spark to replace what sqoop 
>>> does? Any existing project done in similar lines?
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>>> 
>>> 
>>> 
>>> -- 
>>> ---
>>> Takeshi Yamamuro
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha



Re: Sqoop on Spark

2016-04-06 Thread Mich Talebzadeh
Yes JDBC is another option. Need to be aware of some conversion issues like
spark does like CHAR types etc. You best bet is to do the conversion when
fetching data from Oracle itself.

var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb"
var _username : String = "sh"
var _password : String = ""
val c = HiveContext.load("jdbc",
Map("url" -> _ORACLEserver,
"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
sh.channels)",
"user" -> _username,
"password" -> _password))
c.registerTempTable("t_c")


Then put the data from t_c table into Oracle table

HTH






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 April 2016 at 10:34, Jorge Sánchez <jorgesg1...@gmail.com> wrote:

> Ayan,
>
> there was a talk in spark summit
> https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/
> Apparently they had a lot of problems and the project seems abandoned.
>
> If you just have to do simple ingestion of a full table or a simple query,
> just use Sqoop as suggested by Mich, but if your use case requires further
> transformation of the data, I'd suggest you try Spark connecting to Oracle
> using JDBC and then having the data as a Dataframe.
>
> Regards.
>
> 2016-04-06 6:59 GMT+01:00 ayan guha <guha.a...@gmail.com>:
>
>> Thanks guys for feedback.
>>
>> On Wed, Apr 6, 2016 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> I do not think you can be more resource efficient. In the end you have
>>> to store the data anyway on HDFS . You have a lot of development effort for
>>> doing something like sqoop. Especially with error handling.
>>> You may create a ticket with the Sqoop guys to support Spark as an
>>> execution engine and maybe it is less effort to plug it in there.
>>> Maybe if your cluster is loaded then you may want to add more machines
>>> or improve the existing programs.
>>>
>>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>> One of the reason in my mind is to avoid Map-Reduce application
>>> completely during ingestion, if possible. Also, I can then use Spark stand
>>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What
>>> you guys think?
>>>
>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>> Why do you want to reimplement something which is already there?
>>>>
>>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>> Hi
>>>>
>>>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>>>> index and incremental only) and add data to existing Hive tables. Also, it
>>>> would be good to have an option to create Hive table, driven by job
>>>> specific configuration.
>>>>
>>>> What do you think?
>>>>
>>>> Best
>>>> Ayan
>>>>
>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It depends on your use case using sqoop.
>>>>> What's it like?
>>>>>
>>>>> // maropu
>>>>>
>>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> Hi All
>>>>>>
>>>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>>>> sqoop does? Any existing project done in similar lines?
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Sqoop on Spark

2016-04-06 Thread Jorge Sánchez
Ayan,

there was a talk in spark summit
https://spark-summit.org/2015/events/Sqoop-on-Spark-for-Data-Ingestion/
Apparently they had a lot of problems and the project seems abandoned.

If you just have to do simple ingestion of a full table or a simple query,
just use Sqoop as suggested by Mich, but if your use case requires further
transformation of the data, I'd suggest you try Spark connecting to Oracle
using JDBC and then having the data as a Dataframe.

Regards.

2016-04-06 6:59 GMT+01:00 ayan guha <guha.a...@gmail.com>:

> Thanks guys for feedback.
>
> On Wed, Apr 6, 2016 at 3:44 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I do not think you can be more resource efficient. In the end you have to
>> store the data anyway on HDFS . You have a lot of development effort for
>> doing something like sqoop. Especially with error handling.
>> You may create a ticket with the Sqoop guys to support Spark as an
>> execution engine and maybe it is less effort to plug it in there.
>> Maybe if your cluster is loaded then you may want to add more machines or
>> improve the existing programs.
>>
>> On 06 Apr 2016, at 07:33, ayan guha <guha.a...@gmail.com> wrote:
>>
>> One of the reason in my mind is to avoid Map-Reduce application
>> completely during ingestion, if possible. Also, I can then use Spark stand
>> alone cluster to ingest, even if my hadoop cluster is heavily loaded. What
>> you guys think?
>>
>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Why do you want to reimplement something which is already there?
>>>
>>> On 06 Apr 2016, at 06:47, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>> Hi
>>>
>>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>>> index and incremental only) and add data to existing Hive tables. Also, it
>>> would be good to have an option to create Hive table, driven by job
>>> specific configuration.
>>>
>>> What do you think?
>>>
>>> Best
>>> Ayan
>>>
>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> It depends on your use case using sqoop.
>>>> What's it like?
>>>>
>>>> // maropu
>>>>
>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> Hi All
>>>>>
>>>>> Asking opinion: is it possible/advisable to use spark to replace what
>>>>> sqoop does? Any existing project done in similar lines?
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Sqoop on Spark

2016-04-05 Thread ayan guha
Thanks guys for feedback.

On Wed, Apr 6, 2016 at 3:44 PM, Jörn Franke  wrote:

> I do not think you can be more resource efficient. In the end you have to
> store the data anyway on HDFS . You have a lot of development effort for
> doing something like sqoop. Especially with error handling.
> You may create a ticket with the Sqoop guys to support Spark as an
> execution engine and maybe it is less effort to plug it in there.
> Maybe if your cluster is loaded then you may want to add more machines or
> improve the existing programs.
>
> On 06 Apr 2016, at 07:33, ayan guha  wrote:
>
> One of the reason in my mind is to avoid Map-Reduce application completely
> during ingestion, if possible. Also, I can then use Spark stand alone
> cluster to ingest, even if my hadoop cluster is heavily loaded. What you
> guys think?
>
> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke  wrote:
>
>> Why do you want to reimplement something which is already there?
>>
>> On 06 Apr 2016, at 06:47, ayan guha  wrote:
>>
>> Hi
>>
>> Thanks for reply. My use case is query ~40 tables from Oracle (using
>> index and incremental only) and add data to existing Hive tables. Also, it
>> would be good to have an option to create Hive table, driven by job
>> specific configuration.
>>
>> What do you think?
>>
>> Best
>> Ayan
>>
>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> It depends on your use case using sqoop.
>>> What's it like?
>>>
>>> // maropu
>>>
>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha  wrote:
>>>
 Hi All

 Asking opinion: is it possible/advisable to use spark to replace what
 sqoop does? Any existing project done in similar lines?

 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>


-- 
Best Regards,
Ayan Guha


Sqoop on Spark

2016-04-05 Thread ayan guha
Hi All

Asking opinion: is it possible/advisable to use spark to replace what sqoop
does? Any existing project done in similar lines?

-- 
Best Regards,
Ayan Guha