Re: optimize hive query to move a subset of data from one partition table to another table

2018-02-11 Thread Richard Qiao
Would you mind share your code with us to analyze?

> On Feb 10, 2018, at 10:18 AM, amit kumar singh  wrote:
> 
> Hi Team,
> 
> We have hive external  table which has 50 tb of data partitioned on year 
> month day
> 
> i want to move last 2 month of data into another table
> 
> when i try to do this through spark ,more than 120k task are getting created
> 
> what is the best way to do this
> 
> thanks
> Rohit


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread Richard Qiao
Can find a good source for documents, but the source code 
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to 
answer some of them.

For example:
  inputRowsPerSecond = numRecords / inputTimeSec,
  processedRowsPerSecond = numRecords / processingTimeSec
This is explaining why the 2 rowPerSec difference.

> On Feb 10, 2018, at 8:42 PM, M Singh  wrote:
> 
> Hi:
> 
> I am working with spark 2.2.0 and am looking at the query status console 
> output.  
> 
> My application reads from kafka - performs flatMapGroupsWithState and then 
> aggregates the elements for two group counts.  The output is send to console 
> sink.  I see the following output  (with my questions in bold). 
> 
> Please me know where I can find detailed description of the query status 
> fields for spark structured streaming ?
> 
> 
> StreamExecution: Streaming query made progress: {
>   "id" : "8eff62a9-81a8-4142-b332-3e5ec63e06a2",
>   "runId" : "21778fbb-406c-4c65-bdef-d9d2c24698ce",
>   "name" : null,
>   "timestamp" : "2018-02-11T01:18:00.005Z",
>   "numInputRows" : 5780,
>   "inputRowsPerSecond" : 96.32851690748795,
>   "processedRowsPerSecond" : 583.9563548191554,   // Why is the number of 
> processedRowsPerSecond greater than inputRowsPerSecond ? Does this include 
> shuffling/grouping ?
>   "durationMs" : {
> "addBatch" : 9765,// 
> Is the time taken to get send output to all console output streams ? 
> "getBatch" : 3,   
> // Is this time taken to get the batch from Kafka ?
> "getOffset" : 3,  
>  // Is this time for getting offset from Kafka ?
> "queryPlanning" : 89, // 
> The value of this field changes with different triggers but the query is not 
> changing so why does this change ?
> "triggerExecution" : 9898, // Is 
> this total time for this trigger ?
> "walCommit" : 35 // 
> Is this for checkpointing ?
>   },
>   "stateOperators" : [ {   // 
> What are the two state operators ? I am assuming one is flatMapWthState 
> (first one).
> "numRowsTotal" : 8,
> "numRowsUpdated" : 1
>   }, {
> "numRowsTotal" : 6,//Is 
> this the group by state operator ?  If so, I have two group by so why do I 
> see only one ?
> "numRowsUpdated" : 6
>   } ],
>   "sources" : [ {
> "description" : "KafkaSource[Subscribe[xyz]]",
> "startOffset" : {
>   "xyz" : {
> "2" : 9183,
> "1" : 9184,
> "3" : 9184,
> "0" : 9183
>   }
> },
> "endOffset" : {
>   "xyz" : {
> "2" : 10628,
> "1" : 10629,
> "3" : 10629,
> "0" : 10628
>   }
> },
> "numInputRows" : 5780,
> "inputRowsPerSecond" : 96.32851690748795,
> "processedRowsPerSecond" : 583.9563548191554
>   } ],
>   "sink" : {
> "description" : 
> "org.apache.spark.sql.execution.streaming.ConsoleSink@15fc109c"
>   }
> }
> 
> 



Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Richard Qiao
2 points to consider:
1. Check sql server/simba max connection number
2. Allocate 3-5 cores for each executor and allocate more executors.

Sent from my iPhone

> On Jan 16, 2018, at 04:01, Onur EKİNCİ <oeki...@innova.com.tr> wrote:
> 
> Sorry it is not oracle. It is Mssql.
>  
> Do you have any opinion for the solution. I really appreciate
>  
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> From: Richard Qiao [mailto:richardqiao2...@gmail.com] 
> Sent: Tuesday, January 16, 2018 11:59 AM
> To: Onur EKİNCİ <oeki...@innova.com.tr>
> Cc: user@spark.apache.org
> Subject: Re: Run jobs in parallel in standalone mode
>  
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
> Also kindly reminder scrubbing your user id password.
> 
> Sent from my iPhone
> 
> On Jan 16, 2018, at 03:00, Onur EKİNCİ <oeki...@innova.com.tr> wrote:
> 
> Hi,
>  
> We are trying to get data from an Oracle database into Kinetica database 
> through Apache Spark.
>  
> We installed Spark in standalone mode. We executed the following commands. 
> However, we have tried everything but we couldnt manage to run jobs in 
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>  
> We also added  in the spark-defaults.conf  :
> spark.executor.memory=64g
> spark.executor.cores=32
> spark.default.parallelism=32
> spark.cores.max=64
> spark.scheduler.mode=FAIR
> spark.sql.shuffle.partions=32
>  
>  
> On the machine: 10.20.10.228
> ./start-master.sh --webui-port 8585
>  
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine 10.20.10.229:
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine: 10.20.10.228:
>  
> We start the Spark shell:
>  
> spark-shell --master spark://10.20.10.228:7077
>  
> Then we make configurations:
>  
> val df  = spark.read.format("jdbc").option("url", 
> "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", 
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
> "Kinetica2017!").load()
> import com.kinetica.spark._
> val lp = new LoaderParams("http://10.20.10.228:9191;, 
> "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", 
> false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1)
> SparkKineticaLoader.KineticaWriter(df,lp);
>  
>  
> The above commands successfully work. The data transfer completes. However, 
> jobs work serially not in parallel. Also executors work serially and take 
> turns. They donw work in parallel.
>  
> How can we make jobs work in parallel?
>  
>  
> 
>  
> 
>  
> 
> 
>  
> 
>  
>  
>  
>  
>  
>  
> 
>  
>  
> I really appreciate your help. We have done everything that we could.
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> 
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar 
> dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp


Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Richard Qiao
Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

> On Jan 16, 2018, at 03:00, Onur EKİNCİ  wrote:
> 
> Hi,
>  
> We are trying to get data from an Oracle database into Kinetica database 
> through Apache Spark.
>  
> We installed Spark in standalone mode. We executed the following commands. 
> However, we have tried everything but we couldnt manage to run jobs in 
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>  
> We also added  in the spark-defaults.conf  :
> spark.executor.memory=64g
> spark.executor.cores=32
> spark.default.parallelism=32
> spark.cores.max=64
> spark.scheduler.mode=FAIR
> spark.sql.shuffle.partions=32
>  
>  
> On the machine: 10.20.10.228
> ./start-master.sh --webui-port 8585
>  
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine 10.20.10.229:
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine: 10.20.10.228:
>  
> We start the Spark shell:
>  
> spark-shell --master spark://10.20.10.228:7077
>  
> Then we make configurations:
>  
> val df  = spark.read.format("jdbc").option("url", 
> "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", 
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
> "Kinetica2017!").load()
> import com.kinetica.spark._
> val lp = new LoaderParams("http://10.20.10.228:9191;, 
> "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", 
> false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1)
> SparkKineticaLoader.KineticaWriter(df,lp);
>  
>  
> The above commands successfully work. The data transfer completes. However, 
> jobs work serially not in parallel. Also executors work serially and take 
> turns. They donw work in parallel.
>  
> How can we make jobs work in parallel?
>  
>  
> 
>  
> 
>  
> 
> 
>  
> 
>  
>  
>  
>  
>  
>  
> 
>  
>  
> I really appreciate your help. We have done everything that we could.
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> 
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar 
> dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp