WrappedArray to row of relational Db
I have nested structure which i read from an xml using spark-Xml. I want to use spark sql to convert this nested structure to different relational tables (WrappedArray([WrappedArray([[null,592006340,null],null,BA,M,1724]),N,2017-04-05T16:31:03,586257528),659925562) which has a schema: StructType(StructField(AirSegment,ArrayType(StructType( StructField(CodeshareDetails,ArrayType(StructType(StructField(Links,StructType(StructField(_VALUE,StringType,true), StructField(_mktSegmentID,LongType,true), StructField(_oprSegmentID,LongType,true)),true), StructField(_alphaSuffix,StringType,true), StructField(_carrierCode,StringType,true), StructField(_codeshareType,StringType,true), StructField(_flightNumber,StringType,true)),true),true), StructField(_adsIsDeleted,StringType,true), StructField(_adsLastUpdateTimestamp,StringType,true), StructField(_AirID,LongType,true)),true),true), StructField(flightId,LongType,true)) *Question: Now as you can see this codeshareDetails is a wrappedArray inside a Wrapped array. How can I extract these wrapped array rows along with the _AirID column so that I can insert these rows in the codeshare table (sqliteDb) (having column related to codeshare only along with _AirID as foreign key, used for joining back).* *PS:I tried exploding but in case if there are multiple rows in the AirSegment array it doesnt work properly* My table Structure is mentioned below: Flight-contatining flightId and other Details: AirSegment: Containing _AirID(PK), flightID(FK), and AirSegmentDetails CodeshareDetails: containing CodeshareDetails as well as _AirID(FK) Let me know if you need any more information -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/WrappedArray-to-row-of-relational-Db-tp28625.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Running multiple Spark Jobs on Yarn( Client mode)
I have a silly question: Do multiple spark jobs running on yarn have any impact on each other? e.g. If the traffic on one streaming job increases too much does it have any effect on second job? Will it slow it down or any other consequences? I have enough resources(memory,cores) for both jobs in the same cluster. Thanks Vaibhav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-multiple-Spark-Jobs-on-Yarn-Client-mode-tp27364.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
JoinWithCassandraTable over individual queries
Hi I have an RDD with elements as tuple ((key1,key2),value) where (key1,key2) is the partitioning key in my Cassandra table Now for each such element I have to do a read from Cassandra table. My Cassandra table and spark cluster are in different nodes and cant be co-located. Right now I am doing individual query using session.execute("...").* Should I prefer joinWithCassandraTable over individual queries? Do I get some performance benefit?* As i understand joinWithCassandraTable is ultimately going to perform queries for each partitioningKey(or primary key not sure). Regards Vaibhav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JoinWithCassandraTable-over-individual-queries-tp26833.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Relation between number of partitions and cores.
As per Spark programming guide, it says "we should have 2-4 partitions for each CPU in your cluster.". In this case how does 1 CPU core process 2-4 partitions at the same time? Does it do context switching between tasks or run them in parallel? If it does context switching how is it efficient compared to 1:1 partition vs Core? PS: If we are using Kafka direct API in which kafka partitions= Rdd partitions. Does that mean we should give 40 kafka partitions for 10 CPU Cores? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Relation-between-number-of-partitions-and-cores-tp26658.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is one batch created by Streaming Context always equal to one RDD?
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
log4j Spark-worker performance problem
Hello We need a lot of logging for our application about 1000 lines needed to be logged per message we process and we process 1000 msgs/sec. So total lines needed to be logged is /1000*1000/sec/. As it is going to be written in a file. Will writing so much logs will impact the processing power of spark by a lot? If yes, What can be the alternative? Note: This much logging is required for the appropriate monitoring of the application. Let me know if more information is needed. Thanks Vaibhav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-Spark-worker-performance-problem-tp24842.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org