Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Simon Kitching
Have you tried simply making a list with your tables in it, then using 
SparkContext.makeRDD(Seq)? ie

val tablenames = List("table1", "table2", "table3", ...)
val tablesRDD = sc.makeRDD(tablenames, nParallelTasks)
tablesRDD.foreach()

> Am 17.07.2017 um 14:12 schrieb FN :
> 
> Hi
> I am currently trying to parallelize reading multiple tables from Hive . As
> part of an archival framework, i need to convert few hundred tables which
> are in txt format to Parquet. For now i am calling a Spark SQL inside a for
> loop for conversion. But this execute sequential and entire process takes
> longer time to finish.
> 
> I tired  submitting 4 different Spark jobs ( each having set of tables to be
> converted), it did give me some parallelism , but i would like to do this in
> single Spark job due to few limitation of our cluster and process
> 
> Any help will be greatly appreciated 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Hive-tables-Parallel-in-Spark-tp28869.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Glue-like Functionality

2017-07-10 Thread Simon Kitching
Sounds similar to Confluent Kafka Schema Registry and Kafka Connect.

The Schema Registry and Kafka Connect themselves are open-source, but some of 
the datasource-specific adapters, and GUIs to manage it all, are not 
open-source (see Confluent Enterprise Edition).

Note that the Schema Registry and Kafka Connect are generic tools, and not 
spark-specific.

Regards, Simon

> Am 08.07.2017 um 19:49 schrieb Benjamin Kim :
> 
> Has anyone seen AWS Glue? I was wondering if there is something similar going 
> to be built into Spark Structured Streaming? I like the Data Catalog idea to 
> store and track any data source/destination. It profiles the data to derive 
> the scheme and data types. Also, it does some sort-of automated schema 
> evolution when or if the schema changes. It leaves only the transformation 
> logic to the ETL developer. I think some of this can enhance or simplify 
> Structured Streaming. For example, AWS S3 can be catalogued as a Data Source; 
> in Structured Streaming, Input DataFrame is created like a SQL view based off 
> of the S3 Data Source; lastly, the Transform logic, if any, just manipulates 
> the data going from the Input DataFrame to the Result DataFrame, which is 
> another view based off of a catalogued Data Destination. This would relieve 
> the ETL developer from caring about any Data Source or Destination. All 
> server information, access credentials, data schemas, folder directory 
> structures, file formats, and any other properties can be securely stored 
> away with only a select few.
> 
> I'm just curious to know if anyone has thought the same thing.
> 
> Cheers,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org