How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?

2015-10-20 Thread DEVAN M.S.
Hi all, I am trying to calculate Histogram of all columns from a CSV file using Spark Scala. I found that DoubleRDDFunctions supporting Histogram. So i coded like following for getting histogram of all columns. 1. Get column count 2. Create RDD[double] of each column and calculate Histogram of

SORT BY and ORDER BY file size v/s RAM size

2015-02-28 Thread DEVAN M.S.
*Hi devs,* *Is there any connection between the input file size and RAM size for sorting using SparkSQL ?* *I tried 1 GB file with 8 GB RAM with 4 cores and got java.lang.OutOfMemoryError: GC overhead limit exceeded.* *Or could it be for any other reason ? Its working for other SparkSQL

Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me to find out best way for K

Re: reducing number of output files

2015-01-22 Thread DEVAN M.S.
Rdd.coalesce(1) will coalesce RDD and give only one output file. coalesce(2) will give 2 wise versa. On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote: One output file is produced per partition. If you want fewer, use coalesce() before saving the RDD. On Thu, Jan 22, 2015 at 10:46

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Can you share your code ? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Yes, this is what I'm doing. I'm using hiveContext.hql() to run

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'm using

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Add one more library libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) repalce sqlContext with hiveContext. Its working while using HiveContext for me. Devan M.S. | Research Associate | Cyber Security | AMRITA

KNN for large data set

2015-01-20 Thread DEVAN M.S.
Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.

How to collect() each partition in scala ?

2014-12-30 Thread DEVAN M.S.
Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ?