How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?

2015-10-20 Thread DEVAN M.S.
Hi all,


I am trying to calculate Histogram of all columns from a CSV file using
Spark Scala.
I found that DoubleRDDFunctions supporting Histogram.
So i coded like following for getting histogram of all columns.

1. Get column count
2. Create RDD[double]  of each column and calculate Histogram of each RDD
using DoubleRDDFunctions

  var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1)
  val histogramData = columnIndexArray.map(columns=>{
 rdd.map(lines => lines(columns)).histogram(6)
 })

Is it a good way ?
Can anyone suggest some better ways to tackle this ?


Thanks in advance.


SORT BY and ORDER BY file size v/s RAM size

2015-02-28 Thread DEVAN M.S.
*Hi devs,*

*Is there any connection between the input file size and RAM size for
sorting using SparkSQL ?*
*I tried 1 GB file with 8 GB RAM with 4 cores and got
java.lang.OutOfMemoryError: GC overhead limit exceeded.*
*Or could it  be for any other reason ? Its working for other SparkSQL
operations.*


15/02/28 16:33:03 INFO Utils: Successfully started service 'sparkDriver' on
port 41392.
15/02/28 16:33:03 INFO SparkEnv: Registering MapOutputTracker
15/02/28 16:33:03 INFO SparkEnv: Registering BlockManagerMaster
15/02/28 16:33:03 INFO DiskBlockManager: Created local directory at
/tmp/spark-ecf4d6f0-c526-48fa-bd8a-d74a8bf64820/spark-4865c193-05e6-4aa1-999b-ab8c426479ab
15/02/28 16:33:03 INFO MemoryStore: MemoryStore started with capacity 944.7
MB
15/02/28 16:33:03 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/02/28 16:33:03 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-af545c0b-15e6-4efa-a151-2c73faba8948/spark-987f58b4-5735-4965-91d1-38f238f4bb11
15/02/28 16:33:03 INFO HttpServer: Starting HTTP Server
15/02/28 16:33:03 INFO Utils: Successfully started service 'HTTP file
server' on port 44588.
15/02/28 16:33:08 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
15/02/28 16:33:08 INFO SparkUI: Started SparkUI at http://10.30.9.7:4040
15/02/28 16:33:08 INFO Executor: Starting executor ID driver on host
localhost
15/02/28 16:33:08 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@10.30.9.7:41392/user/HeartbeatReceiver
15/02/28 16:33:08 INFO NettyBlockTransferService: Server created on 34475
15/02/28 16:33:08 INFO BlockManagerMaster: Trying to register BlockManager
15/02/28 16:33:08 INFO BlockManagerMasterActor: Registering block manager
localhost:34475 with 944.7 MB RAM, BlockManagerId(driver, localhost,
34475)
15/02/28 16:33:08 INFO BlockManagerMaster: Registered BlockManager
15/02/28 16:33:09 INFO MemoryStore: ensureFreeSpace(193213) called with
curMem=0, maxMem=990550425
15/02/28 16:33:09 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 188.7 KB, free 944.5 MB)
15/02/28 16:33:09 INFO MemoryStore: ensureFreeSpace(25432) called with
curMem=193213, maxMem=990550425
15/02/28 16:33:09 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 24.8 KB, free 944.5 MB)
15/02/28 16:33:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on localhost:34475 (size: 24.8 KB, free: 944.6 MB)
15/02/28 16:33:09 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0
15/02/28 16:33:09 INFO SparkContext: Created broadcast 0 from textFile at
SortSQL.scala:20
15/02/28 16:33:10 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/28 16:33:10 INFO ObjectStore: ObjectStore, initialize called
15/02/28 16:33:10 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
15/02/28 16:33:10 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/02/28 16:33:12 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes=Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
15/02/28 16:33:12 INFO MetaStoreDirectSql: MySQL check failed, assuming we
are not on mysql: Lexical error at line 1, column 5.  Encountered: @
(64), after : .
15/02/28 16:33:13 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
embedded-only so does not have its own datastore table.
15/02/28 16:33:13 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MOrder is tagged as
embedded-only so does not have its own datastore table.
15/02/28 16:33:13 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as
embedded-only so does not have its own datastore table.
15/02/28 16:33:13 INFO Datastore: The class
org.apache.hadoop.hive.metastore.model.MOrder is tagged as
embedded-only so does not have its own datastore table.
15/02/28 16:33:13 INFO Query: Reading in results for query
org.datanucleus.store.rdbms.query.SQLQuery@0 since the connection used is
closing
15/02/28 16:33:13 INFO ObjectStore: Initialized ObjectStore
15/02/28 16:33:14 INFO HiveMetaStore: Added admin role in metastore
15/02/28 16:33:14 INFO HiveMetaStore: Added public role in metastore
15/02/28 16:33:14 INFO HiveMetaStore: No user is added in admin role, since
config is empty
15/02/28 16:33:14 INFO SessionState: No Tez session required at this point.
hive.execution.engine=mr.
15/02/28 16:33:14 INFO ParseDriver: Parsing command: SELECT * FROM people
SORT BY B DESC
15/02/28 16:33:14 INFO ParseDriver: Parse Completed
15/02/28 16:33:14 INFO deprecation: mapred.tip.id is deprecated. Instead,
use mapreduce.task.id
15/02/28 16:33:14 INFO deprecation: mapred.task.id is deprecated. Instead,
use mapreduce.task.attempt.id
15/02/28 16:33:14 INFO deprecation: 

Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
Thanks Xiangrui Meng will try this.

And, found this https://github.com/kaushikranjan/knnJoin also.
Will this work with double data ? Can we find out z value of
*Vector(10.3,4.5,3,5)* ?






On Thu, Jan 22, 2015 at 12:25 AM, Xiangrui Meng men...@gmail.com wrote:

 For large datasets, you need hashing in order to compute k-nearest
 neighbors locally. You can start with LSH + k-nearest in Google
 scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui

 On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote:
  Hi all,
 
  Please help me to find out best way for K-nearest neighbor using spark
 for
  large data sets.
 



Re: reducing number of output files

2015-01-22 Thread DEVAN M.S.
Rdd.coalesce(1) will coalesce RDD and give only one output file.
coalesce(2) will give 2 wise versa.
On Jan 23, 2015 4:58 AM, Sean Owen so...@cloudera.com wrote:

 One output file is produced per partition. If you want fewer, use
 coalesce() before saving the RDD.

 On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim kane.ist...@gmail.com wrote:
  How I can reduce number of output files? Is there a parameter to
 saveAsTextFile?
 
  Thanks.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Can you share your code ?


Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Hi,

  Yes, this is what I'm doing. I'm using hiveContext.hql() to run my
 query.

   But, the problem still happens.



 On Tue, Jan 20, 2015 at 7:24 PM, DEVAN M.S. msdeva...@gmail.com wrote:

 Add one more library

 libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0


 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

 repalce sqlContext with hiveContext. Its working while using HiveContext
 for me.



 Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
 VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


 On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S. msdeva...@gmail.com wrote:

 Which context are you using HiveContext or SQLContext ? Can you try
 with HiveContext ??


 Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
 VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


 On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com
 wrote:


 Hi, I'm using Spark 1.2


 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com
 wrote:

  Hi Xuelin,



 What version of Spark are you using?



 Thanks,

 Daoyuan



 *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
 *Sent:* Tuesday, January 20, 2015 5:22 PM
 *To:* User
 *Subject:* IF statement doesn't work in Spark-SQL?





 Hi,



   I'm trying to migrate some hive scripts to Spark-SQL. However, I
 found some statement is incompatible in Spark-sql.



   Here is my SQL. And the same SQL works fine in HIVE environment.



 SELECT

   *if(ad_user_id1000, 1000, ad_user_id) as user_id*

 FROM

   ad_search_keywords



  What I found is, the parser reports error on the *if*
 statement:



 No function to evaluate expression. type: AttributeReference, tree:
 ad_user_id#4





  Anyone have any idea about this?












Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Which context are you using HiveContext or SQLContext ? Can you try
with HiveContext
??


Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com wrote:


 Hi, I'm using Spark 1.2


 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com
 wrote:

  Hi Xuelin,



 What version of Spark are you using?



 Thanks,

 Daoyuan



 *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
 *Sent:* Tuesday, January 20, 2015 5:22 PM
 *To:* User
 *Subject:* IF statement doesn't work in Spark-SQL?





 Hi,



   I'm trying to migrate some hive scripts to Spark-SQL. However, I
 found some statement is incompatible in Spark-sql.



   Here is my SQL. And the same SQL works fine in HIVE environment.



 SELECT

   *if(ad_user_id1000, 1000, ad_user_id) as user_id*

 FROM

   ad_search_keywords



  What I found is, the parser reports error on the *if* statement:



 No function to evaluate expression. type: AttributeReference, tree:
 ad_user_id#4





  Anyone have any idea about this?









Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Add one more library

libraryDependencies += org.apache.spark % spark-hive_2.10 % 1.2.0


val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

repalce sqlContext with hiveContext. Its working while using HiveContext
for me.



Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


On Tue, Jan 20, 2015 at 4:45 PM, DEVAN M.S. msdeva...@gmail.com wrote:

 Which context are you using HiveContext or SQLContext ? Can you try with 
 HiveContext
 ??


 Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA
 VIDYAPEETHAM | Amritapuri | Cell +919946535290 |


 On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao xuelincao2...@gmail.com
 wrote:


 Hi, I'm using Spark 1.2


 On Tue, Jan 20, 2015 at 5:59 PM, Wang, Daoyuan daoyuan.w...@intel.com
 wrote:

  Hi Xuelin,



 What version of Spark are you using?



 Thanks,

 Daoyuan



 *From:* Xuelin Cao [mailto:xuelincao2...@gmail.com]
 *Sent:* Tuesday, January 20, 2015 5:22 PM
 *To:* User
 *Subject:* IF statement doesn't work in Spark-SQL?





 Hi,



   I'm trying to migrate some hive scripts to Spark-SQL. However, I
 found some statement is incompatible in Spark-sql.



   Here is my SQL. And the same SQL works fine in HIVE environment.



 SELECT

   *if(ad_user_id1000, 1000, ad_user_id) as user_id*

 FROM

   ad_search_keywords



  What I found is, the parser reports error on the *if* statement:



 No function to evaluate expression. type: AttributeReference, tree:
 ad_user_id#4





  Anyone have any idea about this?










KNN for large data set

2015-01-20 Thread DEVAN M.S.
Hi all,

Please help me to find out best way for K-nearest neighbor using spark for
large data sets.


How to collect() each partition in scala ?

2014-12-30 Thread DEVAN M.S.
Hi all,
i have one large data-set. when i am getting the number of partitions its
showing 43.
We can't collect() the large data-set in to  memory so i am thinking like
this, collect() each partitions so that it will be small in size.

Any thoughts ?