Re: optimize hive query to move a subset of data from one partition table to another table
Would you mind share your code with us to analyze? > On Feb 10, 2018, at 10:18 AM, amit kumar singhwrote: > > Hi Team, > > We have hive external table which has 50 tb of data partitioned on year > month day > > i want to move last 2 month of data into another table > > when i try to do this through spark ,more than 120k task are getting created > > what is the best way to do this > > thanks > Rohit - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Apache Spark - Structured Streaming Query Status - field descriptions
Can find a good source for documents, but the source code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example: inputRowsPerSecond = numRecords / inputTimeSec, processedRowsPerSecond = numRecords / processingTimeSec This is explaining why the 2 rowPerSec difference. > On Feb 10, 2018, at 8:42 PM, M Singhwrote: > > Hi: > > I am working with spark 2.2.0 and am looking at the query status console > output. > > My application reads from kafka - performs flatMapGroupsWithState and then > aggregates the elements for two group counts. The output is send to console > sink. I see the following output (with my questions in bold). > > Please me know where I can find detailed description of the query status > fields for spark structured streaming ? > > > StreamExecution: Streaming query made progress: { > "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2", > "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce", > "name" : null, > "timestamp" : "2018-02-11T01:18:00.005Z", > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554, // Why is the number of > processedRowsPerSecond greater than inputRowsPerSecond ? Does this include > shuffling/grouping ? > "durationMs" : { > "addBatch" : 9765,// > Is the time taken to get send output to all console output streams ? > "getBatch" : 3, > // Is this time taken to get the batch from Kafka ? > "getOffset" : 3, > // Is this time for getting offset from Kafka ? > "queryPlanning" : 89, // > The value of this field changes with different triggers but the query is not > changing so why does this change ? > "triggerExecution" : 9898, // Is > this total time for this trigger ? > "walCommit" : 35 // > Is this for checkpointing ? > }, > "stateOperators" : [ { // > What are the two state operators ? I am assuming one is flatMapWthState > (first one). > "numRowsTotal" : 8, > "numRowsUpdated" : 1 > }, { > "numRowsTotal" : 6,//Is > this the group by state operator ? If so, I have two group by so why do I > see only one ? > "numRowsUpdated" : 6 > } ], > "sources" : [ { > "description" : "KafkaSource[Subscribe[xyz]]", > "startOffset" : { > "xyz" : { > "2" : 9183, > "1" : 9184, > "3" : 9184, > "0" : 9183 > } > }, > "endOffset" : { > "xyz" : { > "2" : 10628, > "1" : 10629, > "3" : 10629, > "0" : 10628 > } > }, > "numInputRows" : 5780, > "inputRowsPerSecond" : 96.32851690748795, > "processedRowsPerSecond" : 583.9563548191554 > } ], > "sink" : { > "description" : > "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c" > } > } > >
Re: Run jobs in parallel in standalone mode
2 points to consider: 1. Check sql server/simba max connection number 2. Allocate 3-5 cores for each executor and allocate more executors. Sent from my iPhone > On Jan 16, 2018, at 04:01, Onur EKİNCİ <oeki...@innova.com.tr> wrote: > > Sorry it is not oracle. It is Mssql. > > Do you have any opinion for the solution. I really appreciate > > > > Onur EKİNCİ > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 d:+90 212 329 7000 > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps > > > > From: Richard Qiao [mailto:richardqiao2...@gmail.com] > Sent: Tuesday, January 16, 2018 11:59 AM > To: Onur EKİNCİ <oeki...@innova.com.tr> > Cc: user@spark.apache.org > Subject: Re: Run jobs in parallel in standalone mode > > Curious you are using"jdbc:sqlserve" to connect oracle, why? > Also kindly reminder scrubbing your user id password. > > Sent from my iPhone > > On Jan 16, 2018, at 03:00, Onur EKİNCİ <oeki...@innova.com.tr> wrote: > > Hi, > > We are trying to get data from an Oracle database into Kinetica database > through Apache Spark. > > We installed Spark in standalone mode. We executed the following commands. > However, we have tried everything but we couldnt manage to run jobs in > parallel. We use 2 IBM servers each of which has 128cores and 1TB memory. > > We also added in the spark-defaults.conf : > spark.executor.memory=64g > spark.executor.cores=32 > spark.default.parallelism=32 > spark.cores.max=64 > spark.scheduler.mode=FAIR > spark.sql.shuffle.partions=32 > > > On the machine: 10.20.10.228 > ./start-master.sh --webui-port 8585 > > ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077 > > > On the machine 10.20.10.229: > ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077 > > > On the machine: 10.20.10.228: > > We start the Spark shell: > > spark-shell --master spark://10.20.10.228:7077 > > Then we make configurations: > > val df = spark.read.format("jdbc").option("url", > "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", > "dbo.temp_muh_hareket").option("user", "gpudb").option("password", > "Kinetica2017!").load() > import com.kinetica.spark._ > val lp = new LoaderParams("http://10.20.10.228:9191;, > "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", > false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1) > SparkKineticaLoader.KineticaWriter(df,lp); > > > The above commands successfully work. The data transfer completes. However, > jobs work serially not in parallel. Also executors work serially and take > turns. They donw work in parallel. > > How can we make jobs work in parallel? > > > > > > > > > > > > > > > > > > > > I really appreciate your help. We have done everything that we could. > > > Onur EKİNCİ > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 d:+90 212 329 7000 > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps > > > > > Yasal Uyarı : > Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar > dokümanına tabidir : > http://www.innova.com.tr/disclaimer-yasal-uyari.asp
Re: Run jobs in parallel in standalone mode
Curious you are using"jdbc:sqlserve" to connect oracle, why? Also kindly reminder scrubbing your user id password. Sent from my iPhone > On Jan 16, 2018, at 03:00, Onur EKİNCİwrote: > > Hi, > > We are trying to get data from an Oracle database into Kinetica database > through Apache Spark. > > We installed Spark in standalone mode. We executed the following commands. > However, we have tried everything but we couldnt manage to run jobs in > parallel. We use 2 IBM servers each of which has 128cores and 1TB memory. > > We also added in the spark-defaults.conf : > spark.executor.memory=64g > spark.executor.cores=32 > spark.default.parallelism=32 > spark.cores.max=64 > spark.scheduler.mode=FAIR > spark.sql.shuffle.partions=32 > > > On the machine: 10.20.10.228 > ./start-master.sh --webui-port 8585 > > ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077 > > > On the machine 10.20.10.229: > ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077 > > > On the machine: 10.20.10.228: > > We start the Spark shell: > > spark-shell --master spark://10.20.10.228:7077 > > Then we make configurations: > > val df = spark.read.format("jdbc").option("url", > "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", > "dbo.temp_muh_hareket").option("user", "gpudb").option("password", > "Kinetica2017!").load() > import com.kinetica.spark._ > val lp = new LoaderParams("http://10.20.10.228:9191;, > "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", > false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1) > SparkKineticaLoader.KineticaWriter(df,lp); > > > The above commands successfully work. The data transfer completes. However, > jobs work serially not in parallel. Also executors work serially and take > turns. They donw work in parallel. > > How can we make jobs work in parallel? > > > > > > > > > > > > > > > > > > > > I really appreciate your help. We have done everything that we could. > > > Onur EKİNCİ > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 d:+90 212 329 7000 > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps > > > > > Yasal Uyarı : > Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar > dokümanına tabidir : > http://www.innova.com.tr/disclaimer-yasal-uyari.asp