Re: Reading Hive tables Parallel in Spark
The context you use for calling SparkSQL can be used only in the driver. Moreover, collect() works because it takes in local memory the RDD, but it should be used only for debugging reasons(95% of the times), if all your data fits into a single machine memory you shouldn't use Spark at all but some normal database. For your problem, if you still want to use SparkSQL, just use the threads, instead if you want to use parallelize() or foreach you should avoid calling stuff that needs to remain in the driver. On 17 July 2017 at 17:46, Fretz Nuson <nuson.fr...@gmail.com> wrote: > I was getting NullPointerException when trying to call SparkSQL from > foreach. After debugging, i got to know spark session is not available in > executor and could not successfully pass it. > > What i am doing is tablesRDD.foreach.collect() and it works but goes > sequential > > On Mon, Jul 17, 2017 at 5:58 PM, Simon Kitching < > simon.kitch...@unbelievable-machine.com> wrote: > >> Have you tried simply making a list with your tables in it, then using >> SparkContext.makeRDD(Seq)? ie >> >> val tablenames = List("table1", "table2", "table3", ...) >> val tablesRDD = sc.makeRDD(tablenames, nParallelTasks) >> tablesRDD.foreach() >> >> > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>: >> > >> > Hi >> > I am currently trying to parallelize reading multiple tables from Hive >> . As >> > part of an archival framework, i need to convert few hundred tables >> which >> > are in txt format to Parquet. For now i am calling a Spark SQL inside a >> for >> > loop for conversion. But this execute sequential and entire process >> takes >> > longer time to finish. >> > >> > I tired submitting 4 different Spark jobs ( each having set of tables >> to be >> > converted), it did give me some parallelism , but i would like to do >> this in >> > single Spark job due to few limitation of our cluster and process >> > >> > Any help will be greatly appreciated >> > >> > >> > >> > >> > >> > -- >> > View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> >> >
Re: Reading Hive tables Parallel in Spark
I was getting NullPointerException when trying to call SparkSQL from foreach. After debugging, i got to know spark session is not available in executor and could not successfully pass it. What i am doing is tablesRDD.foreach.collect() and it works but goes sequential On Mon, Jul 17, 2017 at 5:58 PM, Simon Kitching < simon.kitch...@unbelievable-machine.com> wrote: > Have you tried simply making a list with your tables in it, then using > SparkContext.makeRDD(Seq)? ie > > val tablenames = List("table1", "table2", "table3", ...) > val tablesRDD = sc.makeRDD(tablenames, nParallelTasks) > tablesRDD.foreach() > > > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>: > > > > Hi > > I am currently trying to parallelize reading multiple tables from Hive . > As > > part of an archival framework, i need to convert few hundred tables which > > are in txt format to Parquet. For now i am calling a Spark SQL inside a > for > > loop for conversion. But this execute sequential and entire process takes > > longer time to finish. > > > > I tired submitting 4 different Spark jobs ( each having set of tables > to be > > converted), it did give me some parallelism , but i would like to do > this in > > single Spark job due to few limitation of our cluster and process > > > > Any help will be greatly appreciated > > > > > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >
Re: Reading Hive tables Parallel in Spark
I did threading but got many failed tasks and they were not reprocessed. I am guessing driver lost track of threaded tasks. I had also tired Future and par of scala and same result as above On Mon, Jul 17, 2017 at 5:56 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > Run the spark context in multithreaded way . > > Something like this > > val spark = SparkSession.builder() > .appName("practice") > .config("spark.scheduler.mode","FAIR") > .enableHiveSupport().getOrCreate() > val sc = spark.sparkContext > val hc = spark.sqlContext > > > val thread1 = new Thread { > override def run { >hc.sql("select * from table1") > } >} > >val thread2 = new Thread { > override def run { >hc.sql("select * from table2") > } >} > >thread1.start() >thread2.start() > > > > On Mon, Jul 17, 2017 at 5:42 PM, FN <nuson.fr...@gmail.com> wrote: > >> Hi >> I am currently trying to parallelize reading multiple tables from Hive . >> As >> part of an archival framework, i need to convert few hundred tables which >> are in txt format to Parquet. For now i am calling a Spark SQL inside a >> for >> loop for conversion. But this execute sequential and entire process takes >> longer time to finish. >> >> I tired submitting 4 different Spark jobs ( each having set of tables to >> be >> converted), it did give me some parallelism , but i would like to do this >> in >> single Spark job due to few limitation of our cluster and process >> >> Any help will be greatly appreciated >> >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Reading Hive tables Parallel in Spark
Put your jobs into a parallel collection using .par -- then you can submit them very easily to Spark, using .foreach. The jobs will then run using the FIFO scheduler in Spark. The advantage over the prior approaches are, that you won't have to deal with Threads, and that you can leave parallelism completely to Spark. On Mon, Jul 17, 2017 at 2:28 PM, Simon Kitching < simon.kitch...@unbelievable-machine.com> wrote: > Have you tried simply making a list with your tables in it, then using > SparkContext.makeRDD(Seq)? ie > > val tablenames = List("table1", "table2", "table3", ...) > val tablesRDD = sc.makeRDD(tablenames, nParallelTasks) > tablesRDD.foreach() > > > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>: > > > > Hi > > I am currently trying to parallelize reading multiple tables from Hive . > As > > part of an archival framework, i need to convert few hundred tables which > > are in txt format to Parquet. For now i am calling a Spark SQL inside a > for > > loop for conversion. But this execute sequential and entire process takes > > longer time to finish. > > > > I tired submitting 4 different Spark jobs ( each having set of tables > to be > > converted), it did give me some parallelism , but i would like to do > this in > > single Spark job due to few limitation of our cluster and process > > > > Any help will be greatly appreciated > > > > > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Reading Hive tables Parallel in Spark
Have you tried simply making a list with your tables in it, then using SparkContext.makeRDD(Seq)? ie val tablenames = List("table1", "table2", "table3", ...) val tablesRDD = sc.makeRDD(tablenames, nParallelTasks) tablesRDD.foreach() > Am 17.07.2017 um 14:12 schrieb FN <nuson.fr...@gmail.com>: > > Hi > I am currently trying to parallelize reading multiple tables from Hive . As > part of an archival framework, i need to convert few hundred tables which > are in txt format to Parquet. For now i am calling a Spark SQL inside a for > loop for conversion. But this execute sequential and entire process takes > longer time to finish. > > I tired submitting 4 different Spark jobs ( each having set of tables to be > converted), it did give me some parallelism , but i would like to do this in > single Spark job due to few limitation of our cluster and process > > Any help will be greatly appreciated > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Reading Hive tables Parallel in Spark
Run the spark context in multithreaded way . Something like this val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val sc = spark.sparkContext val hc = spark.sqlContext val thread1 = new Thread { override def run { hc.sql("select * from table1") } } val thread2 = new Thread { override def run { hc.sql("select * from table2") } } thread1.start() thread2.start() On Mon, Jul 17, 2017 at 5:42 PM, FN <nuson.fr...@gmail.com> wrote: > Hi > I am currently trying to parallelize reading multiple tables from Hive . As > part of an archival framework, i need to convert few hundred tables which > are in txt format to Parquet. For now i am calling a Spark SQL inside a for > loop for conversion. But this execute sequential and entire process takes > longer time to finish. > > I tired submitting 4 different Spark jobs ( each having set of tables to > be > converted), it did give me some parallelism , but i would like to do this > in > single Spark job due to few limitation of our cluster and process > > Any help will be greatly appreciated > > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Reading Hive tables Parallel in Spark
Hello, have you tried to use threads instead of the loop? On 17 July 2017 at 14:12, FN <nuson.fr...@gmail.com> wrote: > Hi > I am currently trying to parallelize reading multiple tables from Hive . As > part of an archival framework, i need to convert few hundred tables which > are in txt format to Parquet. For now i am calling a Spark SQL inside a for > loop for conversion. But this execute sequential and entire process takes > longer time to finish. > > I tired submitting 4 different Spark jobs ( each having set of tables to > be > converted), it did give me some parallelism , but i would like to do this > in > single Spark job due to few limitation of our cluster and process > > Any help will be greatly appreciated > > > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Reading Hive tables Parallel in Spark
Hi I am currently trying to parallelize reading multiple tables from Hive . As part of an archival framework, i need to convert few hundred tables which are in txt format to Parquet. For now i am calling a Spark SQL inside a for loop for conversion. But this execute sequential and entire process takes longer time to finish. I tired submitting 4 different Spark jobs ( each having set of tables to be converted), it did give me some parallelism , but i would like to do this in single Spark job due to few limitation of our cluster and process Any help will be greatly appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org