Re: custom RDD in java

2015-07-01 Thread Feynman Liang
AFAIK RDDs can only be created on the driver, not the executors. Also,
`saveAsTextFile(...)` is an action and hence can also only be executed on
the driver.

As Silvio already mentioned, Sqoop may be a good option.

On Wed, Jul 1, 2015 at 12:46 PM, Shushant Arora 
wrote:

> List of tables is not large , RDD is created on table list to parllelise
> the work of fetching tables in multiple mappers at same time.Since time
> taken to fetch a table is significant , so can't run that sequentially.
>
>
> Content of table fetched by a map job is large, so one option is to dump
> content to hdfs using filesystem api from inside map function for every few
> rows of table fetched.
>
> I cannot keep complete table in memory and then dump in hdfs using below
> map function-
>
> JavaRDD tablecontent = tablelistrdd.map(new
> Function>)
> {public Iterable call(String tablename){
> ..make jdbc connection get table data and populate in list and return
> that..
>  }
>  tablecontent .saveAsTextFile("hdfspath");
>
> Here I wanted to create customRDD- whose partitions would be in memory on
> multiple executors and contains parts of table data. And i would have
> called saveAsTextFile on customRDD directly to save in hdfs.
>
>
>
> On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang 
> wrote:
>
>>
>> On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora > > wrote:
>>
>>> JavaRDD rdd = javasparkcontext.parllelise(tables);
>>
>>
>> You are already creating an RDD in Java here ;)
>>
>> However, it's not clear to me why you'd want to make this an RDD. Is the
>> list of tables so large that it doesn't fit on a single machine? If not,
>> you may be better off spinning up one spark job for dumping each table in
>> tables using a JDBC datasource
>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
>> .
>>
>> On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
>> silvio.fior...@granturing.com> wrote:
>>
>>>   Sure, you can create custom RDDs. Haven’t done so in Java, but in
>>> Scala absolutely.
>>>
>>>   From: Shushant Arora
>>> Date: Wednesday, July 1, 2015 at 1:44 PM
>>> To: Silvio Fiorito
>>> Cc: user
>>> Subject: Re: custom RDD in java
>>>
>>>   ok..will evaluate these options but is it possible to create RDD in
>>> java?
>>>
>>>
>>> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
>>> silvio.fior...@granturing.com> wrote:
>>>
>>>>  If all you’re doing is just dumping tables from SQLServer to HDFS,
>>>> have you looked at Sqoop?
>>>>
>>>>  Otherwise, if you need to run this in Spark could you just use the
>>>> existing JdbcRDD?
>>>>
>>>>
>>>>   From: Shushant Arora
>>>> Date: Wednesday, July 1, 2015 at 10:19 AM
>>>> To: user
>>>> Subject: custom RDD in java
>>>>
>>>>   Hi
>>>>
>>>>  Is it possible to write custom RDD in java?
>>>>
>>>>  Requirement is - I am having a list of Sqlserver tables  need to be
>>>> dumped in HDFS.
>>>>
>>>>  So I have a
>>>> List tables = {dbname.tablename,dbname.tablename2..};
>>>>
>>>>  then
>>>> JavaRDD rdd = javasparkcontext.parllelise(tables);
>>>>
>>>>  JavaRDDString> tablecontent = rdd.map(new
>>>> Function>){fetch table and return populate 
>>>> iterable}
>>>>
>>>>  tablecontent.storeAsTextFile("hffs path");
>>>>
>>>>
>>>>  In rdd.map(new Function). I cannot keep complete table
>>>> content in memory , so I want to creat my own RDD to handle it.
>>>>
>>>>  Thanks
>>>> Shushant
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Re: custom RDD in java

2015-07-01 Thread Shushant Arora
List of tables is not large , RDD is created on table list to parllelise
the work of fetching tables in multiple mappers at same time.Since time
taken to fetch a table is significant , so can't run that sequentially.


Content of table fetched by a map job is large, so one option is to dump
content to hdfs using filesystem api from inside map function for every few
rows of table fetched.

I cannot keep complete table in memory and then dump in hdfs using below
map function-

JavaRDD tablecontent = tablelistrdd.map(new
Function>)
{public Iterable call(String tablename){
..make jdbc connection get table data and populate in list and return that..
 }
 tablecontent .saveAsTextFile("hdfspath");

Here I wanted to create customRDD- whose partitions would be in memory on
multiple executors and contains parts of table data. And i would have
called saveAsTextFile on customRDD directly to save in hdfs.



On Thu, Jul 2, 2015 at 12:59 AM, Feynman Liang 
wrote:

>
> On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora 
>  wrote:
>
>> JavaRDD rdd = javasparkcontext.parllelise(tables);
>
>
> You are already creating an RDD in Java here ;)
>
> However, it's not clear to me why you'd want to make this an RDD. Is the
> list of tables so large that it doesn't fit on a single machine? If not,
> you may be better off spinning up one spark job for dumping each table in
> tables using a JDBC datasource
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
> .
>
> On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>>   Sure, you can create custom RDDs. Haven’t done so in Java, but in
>> Scala absolutely.
>>
>>   From: Shushant Arora
>> Date: Wednesday, July 1, 2015 at 1:44 PM
>> To: Silvio Fiorito
>> Cc: user
>> Subject: Re: custom RDD in java
>>
>>   ok..will evaluate these options but is it possible to create RDD in
>> java?
>>
>>
>> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
>> silvio.fior...@granturing.com> wrote:
>>
>>>  If all you’re doing is just dumping tables from SQLServer to HDFS,
>>> have you looked at Sqoop?
>>>
>>>  Otherwise, if you need to run this in Spark could you just use the
>>> existing JdbcRDD?
>>>
>>>
>>>   From: Shushant Arora
>>> Date: Wednesday, July 1, 2015 at 10:19 AM
>>> To: user
>>> Subject: custom RDD in java
>>>
>>>   Hi
>>>
>>>  Is it possible to write custom RDD in java?
>>>
>>>  Requirement is - I am having a list of Sqlserver tables  need to be
>>> dumped in HDFS.
>>>
>>>  So I have a
>>> List tables = {dbname.tablename,dbname.tablename2..};
>>>
>>>  then
>>> JavaRDD rdd = javasparkcontext.parllelise(tables);
>>>
>>>  JavaRDDString> tablecontent = rdd.map(new
>>> Function>){fetch table and return populate iterable}
>>>
>>>  tablecontent.storeAsTextFile("hffs path");
>>>
>>>
>>>  In rdd.map(new Function). I cannot keep complete table
>>> content in memory , so I want to creat my own RDD to handle it.
>>>
>>>  Thanks
>>> Shushant
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: custom RDD in java

2015-07-01 Thread Feynman Liang
On Wed, Jul 1, 2015 at 7:19 AM, Shushant Arora 
 wrote:

> JavaRDD rdd = javasparkcontext.parllelise(tables);


You are already creating an RDD in Java here ;)

However, it's not clear to me why you'd want to make this an RDD. Is the
list of tables so large that it doesn't fit on a single machine? If not,
you may be better off spinning up one spark job for dumping each table in
tables using a JDBC datasource
<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
.

On Wed, Jul 1, 2015 at 12:00 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>   Sure, you can create custom RDDs. Haven’t done so in Java, but in Scala
> absolutely.
>
>   From: Shushant Arora
> Date: Wednesday, July 1, 2015 at 1:44 PM
> To: Silvio Fiorito
> Cc: user
> Subject: Re: custom RDD in java
>
>   ok..will evaluate these options but is it possible to create RDD in
> java?
>
>
> On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
> silvio.fior...@granturing.com> wrote:
>
>>  If all you’re doing is just dumping tables from SQLServer to HDFS, have
>> you looked at Sqoop?
>>
>>  Otherwise, if you need to run this in Spark could you just use the
>> existing JdbcRDD?
>>
>>
>>   From: Shushant Arora
>> Date: Wednesday, July 1, 2015 at 10:19 AM
>> To: user
>> Subject: custom RDD in java
>>
>>   Hi
>>
>>  Is it possible to write custom RDD in java?
>>
>>  Requirement is - I am having a list of Sqlserver tables  need to be
>> dumped in HDFS.
>>
>>  So I have a
>> List tables = {dbname.tablename,dbname.tablename2..};
>>
>>  then
>> JavaRDD rdd = javasparkcontext.parllelise(tables);
>>
>>  JavaRDDString> tablecontent = rdd.map(new
>> Function>){fetch table and return populate iterable}
>>
>>  tablecontent.storeAsTextFile("hffs path");
>>
>>
>>  In rdd.map(new Function). I cannot keep complete table content
>> in memory , so I want to creat my own RDD to handle it.
>>
>>  Thanks
>> Shushant
>>
>>
>>
>>
>>
>>
>>
>


Re: custom RDD in java

2015-07-01 Thread Silvio Fiorito
Sure, you can create custom RDDs. Haven’t done so in Java, but in Scala 
absolutely.

From: Shushant Arora
Date: Wednesday, July 1, 2015 at 1:44 PM
To: Silvio Fiorito
Cc: user
Subject: Re: custom RDD in java

ok..will evaluate these options but is it possible to create RDD in java?


On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito 
mailto:silvio.fior...@granturing.com>> wrote:
If all you’re doing is just dumping tables from SQLServer to HDFS, have you 
looked at Sqoop?

Otherwise, if you need to run this in Spark could you just use the existing 
JdbcRDD?


From: Shushant Arora
Date: Wednesday, July 1, 2015 at 10:19 AM
To: user
Subject: custom RDD in java

Hi

Is it possible to write custom RDD in java?

Requirement is - I am having a list of Sqlserver tables  need to be dumped in 
HDFS.

So I have a
List tables = {dbname.tablename,dbname.tablename2..};

then
JavaRDD rdd = javasparkcontext.parllelise(tables);

JavaRDDString> tablecontent = rdd.map(new 
Function>){fetch table and return populate iterable}

tablecontent.storeAsTextFile("hffs path");


In rdd.map(new Function). I cannot keep complete table content in 
memory , so I want to creat my own RDD to handle it.

Thanks
Shushant









Re: custom RDD in java

2015-07-01 Thread Shushant Arora
ok..will evaluate these options but is it possible to create RDD in java?


On Wed, Jul 1, 2015 at 8:29 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>  If all you’re doing is just dumping tables from SQLServer to HDFS, have
> you looked at Sqoop?
>
>  Otherwise, if you need to run this in Spark could you just use the
> existing JdbcRDD?
>
>
>   From: Shushant Arora
> Date: Wednesday, July 1, 2015 at 10:19 AM
> To: user
> Subject: custom RDD in java
>
>   Hi
>
>  Is it possible to write custom RDD in java?
>
>  Requirement is - I am having a list of Sqlserver tables  need to be
> dumped in HDFS.
>
>  So I have a
> List tables = {dbname.tablename,dbname.tablename2..};
>
>  then
> JavaRDD rdd = javasparkcontext.parllelise(tables);
>
>  JavaRDDString> tablecontent = rdd.map(new
> Function>){fetch table and return populate iterable}
>
>  tablecontent.storeAsTextFile("hffs path");
>
>
>  In rdd.map(new Function). I cannot keep complete table content
> in memory , so I want to creat my own RDD to handle it.
>
>  Thanks
> Shushant
>
>
>
>
>
>
>


Re: custom RDD in java

2015-07-01 Thread Silvio Fiorito
If all you’re doing is just dumping tables from SQLServer to HDFS, have you 
looked at Sqoop?

Otherwise, if you need to run this in Spark could you just use the existing 
JdbcRDD?


From: Shushant Arora
Date: Wednesday, July 1, 2015 at 10:19 AM
To: user
Subject: custom RDD in java

Hi

Is it possible to write custom RDD in java?

Requirement is - I am having a list of Sqlserver tables  need to be dumped in 
HDFS.

So I have a
List tables = {dbname.tablename,dbname.tablename2..};

then
JavaRDD rdd = javasparkcontext.parllelise(tables);

JavaRDDString> tablecontent = rdd.map(new 
Function>){fetch table and return populate iterable}

tablecontent.storeAsTextFile("hffs path");


In rdd.map(new Function). I cannot keep complete table content in 
memory , so I want to creat my own RDD to handle it.

Thanks
Shushant