How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?
Hi all, I am trying to calculate Histogram of all columns from a CSV file using Spark Scala. I found that DoubleRDDFunctions supporting Histogram. So i coded like following for getting histogram of all columns. 1. Get column count 2. Create RDD[double] of each column and calculate Histogram of each RDD using DoubleRDDFunctions var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1) val histogramData = columnIndexArray.map(columns=>{ rdd.map(lines => lines(columns)).histogram(6) }) Is it a good way ? Can anyone suggest some better ways to tackle this ? Thanks in advance.
SORT BY and ORDER BY file size v/s RAM size
*Hi devs,* *Is there any connection between the input file size and RAM size for sorting using SparkSQL ?* *I tried 1 GB file with 8 GB RAM with 4 cores and got java.lang.OutOfMemoryError: GC overhead limit exceeded.* *Or could it be for any other reason ? Its working for other SparkSQL operations.* 15/02/28 16:33:03 INFO Utils: Successfully started service 'sparkDriver' on port 41392. 15/02/28 16:33:03 INFO SparkEnv: Registering MapOutputTracker 15/02/28 16:33:03 INFO SparkEnv: Registering BlockManagerMaster 15/02/28 16:33:03 INFO DiskBlockManager: Created local directory at /tmp/spark-ecf4d6f0-c526-48fa-bd8a-d74a8bf64820/spark-4865c193-05e6-4aa1-999b-ab8c426479ab 15/02/28 16:33:03 INFO MemoryStore: MemoryStore started with capacity 944.7 MB 15/02/28 16:33:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/02/28 16:33:03 INFO HttpFileServer: HTTP File server directory is /tmp/spark-af545c0b-15e6-4efa-a151-2c73faba8948/spark-987f58b4-5735-4965-91d1-38f238f4bb11 15/02/28 16:33:03 INFO HttpServer: Starting HTTP Server 15/02/28 16:33:03 INFO Utils: Successfully started service 'HTTP file server' on port 44588. 15/02/28 16:33:08 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/02/28 16:33:08 INFO SparkUI: Started SparkUI at http://10.30.9.7:4040 15/02/28 16:33:08 INFO Executor: Starting executor ID driver on host localhost 15/02/28 16:33:08 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@10.30.9.7:41392/user/HeartbeatReceiver 15/02/28 16:33:08 INFO NettyBlockTransferService: Server created on 34475 15/02/28 16:33:08 INFO BlockManagerMaster: Trying to register BlockManager 15/02/28 16:33:08 INFO BlockManagerMasterActor: Registering block manager localhost:34475 with 944.7 MB RAM, BlockManagerId(driver, localhost, 34475) 15/02/28 16:33:08 INFO BlockManagerMaster: Registered BlockManager 15/02/28 16:33:09 INFO MemoryStore: ensureFreeSpace(193213) called with curMem=0, maxMem=990550425 15/02/28 16:33:09 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 188.7 KB, free 944.5 MB) 15/02/28 16:33:09 INFO MemoryStore: ensureFreeSpace(25432) called with curMem=193213, maxMem=990550425 15/02/28 16:33:09 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.8 KB, free 944.5 MB) 15/02/28 16:33:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:34475 (size: 24.8 KB, free: 944.6 MB) 15/02/28 16:33:09 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/02/28 16:33:09 INFO SparkContext: Created broadcast 0 from textFile at SortSQL.scala:20 15/02/28 16:33:10 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/02/28 16:33:10 INFO ObjectStore: ObjectStore, initialize called 15/02/28 16:33:10 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored 15/02/28 16:33:10 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored 15/02/28 16:33:12 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order 15/02/28 16:33:12 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . 15/02/28 16:33:13 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/02/28 16:33:13 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/02/28 16:33:13 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 15/02/28 16:33:13 INFO Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 15/02/28 16:33:13 INFO Query: Reading in results for query org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is closing 15/02/28 16:33:13 INFO ObjectStore: Initialized ObjectStore 15/02/28 16:33:14 INFO HiveMetaStore: Added admin role in metastore 15/02/28 16:33:14 INFO HiveMetaStore: Added public role in metastore 15/02/28 16:33:14 INFO HiveMetaStore: No user is added in admin role, since config is empty 15/02/28 16:33:14 INFO SessionState: No Tez session required at this point. hive.execution.engine=mr. 15/02/28 16:33:14 INFO ParseDriver: Parsing command: SELECT * FROM people SORT BY B DESC 15/02/28 16:33:14 INFO ParseDriver: Parse Completed 15/02/28 16:33:14 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/02/28 16:33:14 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/02/28 16:33:14 INFO deprecation:
Re: KNN for large data set
Thanks Xiangrui Meng will try this. And, found this https://github.com/kaushikranjan/knnJoin also. Will this work with double data ? Can we find out z value of *Vector(10.3,4.5,3,5)* ? On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote: For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.
Re: reducing number of output files
Rdd.coalesce(1) will coalesce RDD and give only one output file. coalesce(2) will give 2 wise versa. On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote: One output file is produced per partition. If you want fewer, use coalesce() before saving the RDD. On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote: How I can reduce number of output files? Is there a parameter to saveAsTextFile? Thanks. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: IF statement doesn't work in Spark-SQL?
Can you share your code ? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, Yes, this is what I'm doing. I'm using hiveContext.hql() to run my query. But, the problem still happens. On Tue, Jan 20, 2015 at 7:24 PM, DEVAN M.S. msdeva...@gmail.com wrote: Add one more library libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) repalce sqlContext with hiveContext. Its working while using HiveContext for me. Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S. msdeva...@gmail.com wrote: Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] *Sent:* Tuesday, January 20, 2015 5:22 PM *To:* User *Subject:* IF statement doesn't work in Spark-SQL? Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id1000, 1000, ad_user_id) as user_id* FROM ad_search_keywords What I found is, the parser reports error on the *if* statement: No function to evaluate expression. type: AttributeReference, tree: ad_user_id#4 Anyone have any idea about this?
Re: IF statement doesn't work in Spark-SQL?
Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] *Sent:* Tuesday, January 20, 2015 5:22 PM *To:* User *Subject:* IF statement doesn't work in Spark-SQL? Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id1000, 1000, ad_user_id) as user_id* FROM ad_search_keywords What I found is, the parser reports error on the *if* statement: No function to evaluate expression. type: AttributeReference, tree: ad_user_id#4 Anyone have any idea about this?
Re: IF statement doesn't work in Spark-SQL?
Add one more library libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) repalce sqlContext with hiveContext. Its working while using HiveContext for me. Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S. msdeva...@gmail.com wrote: Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'm using Spark 1.2 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com wrote: Hi Xuelin, What version of Spark are you using? Thanks, Daoyuan *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com] *Sent:* Tuesday, January 20, 2015 5:22 PM *To:* User *Subject:* IF statement doesn't work in Spark-SQL? Hi, I'm trying to migrate some hive scripts to Spark-SQL. However, I found some statement is incompatible in Spark-sql. Here is my SQL. And the same SQL works fine in HIVE environment. SELECT *if(ad_user_id1000, 1000, ad_user_id) as user_id* FROM ad_search_keywords What I found is, the parser reports error on the *if* statement: No function to evaluate expression. type: AttributeReference, tree: ad_user_id#4 Anyone have any idea about this?
KNN for large data set
Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.
How to collect() each partition in scala ?
Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ?