Re: Reading Hive tables Parallel in Spark

2017-07-18 Thread Matteo Cossu
The context you use for calling SparkSQL can be used only in the driver.
Moreover, collect() works because it takes in local memory the RDD, but it
should be used only for debugging reasons(95% of the times), if all your
data fits into a single machine memory you shouldn't use Spark at all but
some normal database.
For your problem, if you still want to use SparkSQL, just use the threads,
instead if you want to use parallelize() or foreach you should avoid
calling stuff that needs to remain in the driver.

On 17 July 2017 at 17:46, Fretz Nuson <nuson.fr...@gmail.com> wrote:

> I was getting NullPointerException when trying to call SparkSQL from
> foreach. After debugging, i got to know spark session is not available in
> executor and could not successfully pass it.
>
> What i am doing is  tablesRDD.foreach.collect() and it works but goes
> sequential
>
> On Mon, Jul 17, 2017 at 5:58 PM, Simon Kitching <
> simon.kitch...@unbelievable-machine.com> wrote:
>
>> Have you tried simply making a list with your tables in it, then using
>> SparkContext.makeRDD(Seq)? ie
>>
>> val tablenames = List("table1", "table2", "table3", ...)
>> val tablesRDD = sc.makeRDD(tablenames, nParallelTasks)
>> tablesRDD.foreach()
>>
>> > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>:
>> >
>> > Hi
>> > I am currently trying to parallelize reading multiple tables from Hive
>> . As
>> > part of an archival framework, i need to convert few hundred tables
>> which
>> > are in txt format to Parquet. For now i am calling a Spark SQL inside a
>> for
>> > loop for conversion. But this execute sequential and entire process
>> takes
>> > longer time to finish.
>> >
>> > I tired  submitting 4 different Spark jobs ( each having set of tables
>> to be
>> > converted), it did give me some parallelism , but i would like to do
>> this in
>> > single Spark job due to few limitation of our cluster and process
>> >
>> > Any help will be greatly appreciated
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>>
>


Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Fretz Nuson
I was getting NullPointerException when trying to call SparkSQL from
foreach. After debugging, i got to know spark session is not available in
executor and could not successfully pass it.

What i am doing is  tablesRDD.foreach.collect() and it works but goes
sequential

On Mon, Jul 17, 2017 at 5:58 PM, Simon Kitching <
simon.kitch...@unbelievable-machine.com> wrote:

> Have you tried simply making a list with your tables in it, then using
> SparkContext.makeRDD(Seq)? ie
>
> val tablenames = List("table1", "table2", "table3", ...)
> val tablesRDD = sc.makeRDD(tablenames, nParallelTasks)
> tablesRDD.foreach()
>
> > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>:
> >
> > Hi
> > I am currently trying to parallelize reading multiple tables from Hive .
> As
> > part of an archival framework, i need to convert few hundred tables which
> > are in txt format to Parquet. For now i am calling a Spark SQL inside a
> for
> > loop for conversion. But this execute sequential and entire process takes
> > longer time to finish.
> >
> > I tired  submitting 4 different Spark jobs ( each having set of tables
> to be
> > converted), it did give me some parallelism , but i would like to do
> this in
> > single Spark job due to few limitation of our cluster and process
> >
> > Any help will be greatly appreciated
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>


Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Fretz Nuson
I did threading but got many failed tasks and they were not reprocessed. I
am guessing driver lost track of threaded tasks. I had also tired Future
and par of scala and same result as above

On Mon, Jul 17, 2017 at 5:56 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> Run the spark context in multithreaded way .
>
> Something like this
>
> val spark =  SparkSession.builder()
>   .appName("practice")
>   .config("spark.scheduler.mode","FAIR")
>   .enableHiveSupport().getOrCreate()
> val sc = spark.sparkContext
> val hc = spark.sqlContext
>
>
> val thread1 = new Thread {
>  override def run {
>hc.sql("select * from table1")
>  }
>}
>
>val thread2 = new Thread {
>  override def run {
>hc.sql("select * from table2")
>  }
>}
>
>thread1.start()
>thread2.start()
>
>
>
> On Mon, Jul 17, 2017 at 5:42 PM, FN <nuson.fr...@gmail.com> wrote:
>
>> Hi
>> I am currently trying to parallelize reading multiple tables from Hive .
>> As
>> part of an archival framework, i need to convert few hundred tables which
>> are in txt format to Parquet. For now i am calling a Spark SQL inside a
>> for
>> loop for conversion. But this execute sequential and entire process takes
>> longer time to finish.
>>
>> I tired  submitting 4 different Spark jobs ( each having set of tables to
>> be
>> converted), it did give me some parallelism , but i would like to do this
>> in
>> single Spark job due to few limitation of our cluster and process
>>
>> Any help will be greatly appreciated
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Rick Moritz
Put your jobs into a parallel collection using .par -- then you can submit
them very easily to Spark, using .foreach. The jobs will then run using the
FIFO scheduler in Spark.

The advantage over the prior approaches are, that you won't have to deal
with Threads, and that you can leave parallelism completely to Spark.

On Mon, Jul 17, 2017 at 2:28 PM, Simon Kitching <
simon.kitch...@unbelievable-machine.com> wrote:

> Have you tried simply making a list with your tables in it, then using
> SparkContext.makeRDD(Seq)? ie
>
> val tablenames = List("table1", "table2", "table3", ...)
> val tablesRDD = sc.makeRDD(tablenames, nParallelTasks)
> tablesRDD.foreach()
>
> > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>:
> >
> > Hi
> > I am currently trying to parallelize reading multiple tables from Hive .
> As
> > part of an archival framework, i need to convert few hundred tables which
> > are in txt format to Parquet. For now i am calling a Spark SQL inside a
> for
> > loop for conversion. But this execute sequential and entire process takes
> > longer time to finish.
> >
> > I tired  submitting 4 different Spark jobs ( each having set of tables
> to be
> > converted), it did give me some parallelism , but i would like to do
> this in
> > single Spark job due to few limitation of our cluster and process
> >
> > Any help will be greatly appreciated
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Simon Kitching
Have you tried simply making a list with your tables in it, then using 
SparkContext.makeRDD(Seq)? ie

val tablenames = List("table1", "table2", "table3", ...)
val tablesRDD = sc.makeRDD(tablenames, nParallelTasks)
tablesRDD.foreach()

> Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>:
> 
> Hi
> I am currently trying to parallelize reading multiple tables from Hive . As
> part of an archival framework, i need to convert few hundred tables which
> are in txt format to Parquet. For now i am calling a Spark SQL inside a for
> loop for conversion. But this execute sequential and entire process takes
> longer time to finish.
> 
> I tired  submitting 4 different Spark jobs ( each having set of tables to be
> converted), it did give me some parallelism , but i would like to do this in
> single Spark job due to few limitation of our cluster and process
> 
> Any help will be greatly appreciated 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Pralabh Kumar
Run the spark context in multithreaded way .

Something like this

val spark =  SparkSession.builder()
  .appName("practice")
  .config("spark.scheduler.mode","FAIR")
  .enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
val hc = spark.sqlContext


val thread1 = new Thread {
 override def run {
   hc.sql("select * from table1")
 }
   }

   val thread2 = new Thread {
 override def run {
   hc.sql("select * from table2")
 }
   }

   thread1.start()
   thread2.start()



On Mon, Jul 17, 2017 at 5:42 PM, FN <nuson.fr...@gmail.com> wrote:

> Hi
> I am currently trying to parallelize reading multiple tables from Hive . As
> part of an archival framework, i need to convert few hundred tables which
> are in txt format to Parquet. For now i am calling a Spark SQL inside a for
> loop for conversion. But this execute sequential and entire process takes
> longer time to finish.
>
> I tired  submitting 4 different Spark jobs ( each having set of tables to
> be
> converted), it did give me some parallelism , but i would like to do this
> in
> single Spark job due to few limitation of our cluster and process
>
> Any help will be greatly appreciated
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Matteo Cossu
Hello,
have you tried to use threads instead of the loop?

On 17 July 2017 at 14:12, FN <nuson.fr...@gmail.com> wrote:

> Hi
> I am currently trying to parallelize reading multiple tables from Hive . As
> part of an archival framework, i need to convert few hundred tables which
> are in txt format to Parquet. For now i am calling a Spark SQL inside a for
> loop for conversion. But this execute sequential and entire process takes
> longer time to finish.
>
> I tired  submitting 4 different Spark jobs ( each having set of tables to
> be
> converted), it did give me some parallelism , but i would like to do this
> in
> single Spark job due to few limitation of our cluster and process
>
> Any help will be greatly appreciated
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Reading Hive tables Parallel in Spark

2017-07-17 Thread FN
Hi
I am currently trying to parallelize reading multiple tables from Hive . As
part of an archival framework, i need to convert few hundred tables which
are in txt format to Parquet. For now i am calling a Spark SQL inside a for
loop for conversion. But this execute sequential and entire process takes
longer time to finish.

I tired  submitting 4 different Spark jobs ( each having set of tables to be
converted), it did give me some parallelism , but i would like to do this in
single Spark job due to few limitation of our cluster and process

Any help will be greatly appreciated 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org