Re: spark sql query optimization , and decision tree building
If you want to calculate mean, variance, minimum, maximum and total count for each columns, especially for features of machine learning, you can try MultivariateOnlineSummarizer. MultivariateOnlineSummarizer implements a numerically stable algorithm to compute sample mean and variance by column in a online fashion. It support both sparse and dense vector which can be constructed from column features. The time complexity is O(nnz) instead of O(n) for each column and nnz represents the number of nunzero of each column. 2014-10-23 1:09 GMT+08:00 sanath kumar sanath1...@gmail.com: Thank you very much , two more small questions : 1) val output = sqlContext.sql(SELECT * From people) my output has 128 columns and single row . how to find the which column has the maximum value in a single row using scala ? 2) as each row has 128 columns how to print each row into a text while with space delimitation or as json using scala? please reply Thanks, Sanath On Wed, Oct 22, 2014 at 8:24 AM, Cheng, Hao hao.ch...@intel.com wrote: The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD 1) How to save result values of a query into a list ? [CH:] val list: Array[Row] = output.collect, however get 1M records into an array seems not a good idea. 2) How to calculate variance of a column .Is there any efficient way? [CH:] Not sure what’s that mean, but you can try output.select(‘colname).groupby ? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on cachedRDD 4) how to save the output as key value pairs in a text file ? [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx) 5) is there any way i can build decision kd tree using machine libraries of spark ? [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in mind SchemaRDD is just a normal RDD. Cheng Hao *From:* sanath kumar [mailto:sanath1...@gmail.com] *Sent:* Wednesday, October 22, 2014 12:58 PM *To:* user@spark.apache.org *Subject:* spark sql query optimization , and decision tree building Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension. I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decision tree (kd tree) with this large data . My steps : 1) calculate variance of each column pick the column with maximum variance and make it as key of first node , and mean of the column as the value of the node. 2) based on the first node value split the data into 2 parts an repeat the process until you reach a point. My sample code : import sqlContext._ val people = sqlContext.jsonFile(siftoutput/) people.printSchema() people.registerTempTable(people) val output = sqlContext.sql(SELECT * From people) My Questions : 1) How to save result values of a query into a list ? 2) How to calculate variance of a column .Is there any efficient way? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? 4) how to save the output as key value pairs in a text file ? 5) is there any way i can build decision kd tree using machine libraries of spark ? please help Thanks, Sanath
RE: spark sql query optimization , and decision tree building
The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD 1) How to save result values of a query into a list ? [CH:] val list: Array[Row] = output.collect, however get 1M records into an array seems not a good idea. 2) How to calculate variance of a column .Is there any efficient way? [CH:] Not sure what’s that mean, but you can try output.select(‘colname).groupby ? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on cachedRDD 4) how to save the output as key value pairs in a text file ? [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx) 5) is there any way i can build decision kd tree using machine libraries of spark ? [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in mind SchemaRDD is just a normal RDD. Cheng Hao From: sanath kumar [mailto:sanath1...@gmail.com] Sent: Wednesday, October 22, 2014 12:58 PM To: user@spark.apache.org Subject: spark sql query optimization , and decision tree building Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension. I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decision tree (kd tree) with this large data . My steps : 1) calculate variance of each column pick the column with maximum variance and make it as key of first node , and mean of the column as the value of the node. 2) based on the first node value split the data into 2 parts an repeat the process until you reach a point. My sample code : import sqlContext._ val people = sqlContext.jsonFile(siftoutput/) people.printSchema() people.registerTempTable(people) val output = sqlContext.sql(SELECT * From people) My Questions : 1) How to save result values of a query into a list ? 2) How to calculate variance of a column .Is there any efficient way? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? 4) how to save the output as key value pairs in a text file ? 5) is there any way i can build decision kd tree using machine libraries of spark ? please help Thanks, Sanath
Re: spark sql query optimization , and decision tree building
Thank you very much , two more small questions : 1) val output = sqlContext.sql(SELECT * From people) my output has 128 columns and single row . how to find the which column has the maximum value in a single row using scala ? 2) as each row has 128 columns how to print each row into a text while with space delimitation or as json using scala? please reply Thanks, Sanath On Wed, Oct 22, 2014 at 8:24 AM, Cheng, Hao hao.ch...@intel.com wrote: The “output” variable is actually a SchemaRDD, it provides lots of DSL API, see http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD 1) How to save result values of a query into a list ? [CH:] val list: Array[Row] = output.collect, however get 1M records into an array seems not a good idea. 2) How to calculate variance of a column .Is there any efficient way? [CH:] Not sure what’s that mean, but you can try output.select(‘colname).groupby ? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? [CH:] val cachedRdd = output.cache(), and do whatever you need to do based on cachedRDD 4) how to save the output as key value pairs in a text file ? [CH:] cachedRdd.generate(xx,xx,xx).saveAsTextFile(xx) 5) is there any way i can build decision kd tree using machine libraries of spark ? [CH:] Sorry, I am not sure about how kd tree used in mllib. but keep in mind SchemaRDD is just a normal RDD. Cheng Hao *From:* sanath kumar [mailto:sanath1...@gmail.com] *Sent:* Wednesday, October 22, 2014 12:58 PM *To:* user@spark.apache.org *Subject:* spark sql query optimization , and decision tree building Hi all , I have a large data in text files (1,000,000 lines) .Each line has 128 columns . Here each line is a feature and each column is a dimension. I have converted the txt files in json format and able to run sql queries on json files using spark. Now i am trying to build a k dimenstion decision tree (kd tree) with this large data . My steps : 1) calculate variance of each column pick the column with maximum variance and make it as key of first node , and mean of the column as the value of the node. 2) based on the first node value split the data into 2 parts an repeat the process until you reach a point. My sample code : import sqlContext._ val people = sqlContext.jsonFile(siftoutput/) people.printSchema() people.registerTempTable(people) val output = sqlContext.sql(SELECT * From people) My Questions : 1) How to save result values of a query into a list ? 2) How to calculate variance of a column .Is there any efficient way? 3) i will be running multiple queries on same data .Does spark has any way to optimize it ? 4) how to save the output as key value pairs in a text file ? 5) is there any way i can build decision kd tree using machine libraries of spark ? please help Thanks, Sanath