Hi all,
I am trying to calculate Histogram of all columns from a CSV file using
Spark Scala.
I found that DoubleRDDFunctions supporting Histogram.
So i coded like following for getting histogram of all columns.
1. Get column count
2. Create RDD[double] of each column and calculate Histogram of
*Hi devs,*
*Is there any connection between the input file size and RAM size for
sorting using SparkSQL ?*
*I tried 1 GB file with 8 GB RAM with 4 cores and got
java.lang.OutOfMemoryError: GC overhead limit exceeded.*
*Or could it be for any other reason ? Its working for other SparkSQL
hashing in order to compute k-nearest
neighbors locally. You can start with LSH + k-nearest in Google
scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui
On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote:
Hi all,
Please help me to find out best way for K
Rdd.coalesce(1) will coalesce RDD and give only one output file.
coalesce(2) will give 2 wise versa.
On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote:
One output file is produced per partition. If you want fewer, use
coalesce() before saving the RDD.
On Thu, Jan 22, 2015 at 10:46
Can you share your code ?
Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao xuelincao2...@gmail.com wrote:
Hi,
Yes, this is what I'm doing. I'm using hiveContext.hql() to run
Which context are you using HiveContext or SQLContext ? Can you try
with HiveContext
??
Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |
On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote:
Hi, I'm using
Add one more library
libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
repalce sqlContext with hiveContext. Its working while using HiveContext
for me.
Devan M.S. | Research Associate | Cyber Security | AMRITA
Hi all,
Please help me to find out best way for K-nearest neighbor using spark for
large data sets.
Hi all,
i have one large data-set. when i am getting the number of partitions its
showing 43.
We can't collect() the large data-set in to memory so i am thinking like
this, collect() each partitions so that it will be small in size.
Any thoughts ?