RE: IOT in Spark

2017-05-19 Thread Lohith Samaga M
Hi Gaurav, You can process IoT data using Spark. But where will you store the raw/processed data - Cassandra, Hive, HBase? You might want to look at the Hadoop cluster for data storage and processing (Spark using Yarn). For processing streaming data, you might also

Spark 2.1.0 with Hive 2.1.1?

2017-05-08 Thread Lohith Samaga M
Hi, Good day. My setup: 1. Single node Hadoop 2.7.3 on Ubuntu 16.04. 2. Hive 2.1.1 with metastore in MySQL. 3. Spark 2.1.0 configured using hive-site.xml to use MySQL metastore. 4. The VERSION table contains SCHEMA_VERSION = 2.1.0 Hive

Spark 2.1.0 and Hive 2.1.1

2017-05-03 Thread Lohith Samaga M
Hi, Good day. My setup: 1. Single node Hadoop 2.7.3 on Ubuntu 16.04. 2. Hive 2.1.1 with metastore in MySQL. 3. Spark 2.1.0 configured using hive-site.xml to use MySQL metastore. 4. The VERSION table contains SCHEMA_VERSION = 2.1.0 Hive

Spark 2 or Spark 1.6.x?

2016-12-11 Thread Lohith Samaga M
Hi, I am new to Spark. I would like to learn Spark. I think I should learn version 2.0.2. Or should I still go for version 1.6.x and then come to version 2.0.2? Please advise. Thanks in advance. Best regards / Mit freundlichen Grüßen / Sincères salutations M.

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Lohith Samaga M
Hi, The aws CLI already has your access key aid and secret access key when you initially configured it. Is your s3 bucket without any access restrictions? Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab

Re: migration from Teradata to Spark SQL

2016-05-04 Thread Lohith Samaga M
Hi Can you look at Apache Drill as sql engine on hive? Lohith Sent from my Sony Xperia™ smartphone Tapan Upadhyay wrote Thank you everyone for guidance. Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have enough bandwidth on our DB for imp batch/queries.

RE: Why Spark having OutOfMemory Exception?

2016-04-11 Thread Lohith Samaga M
Hi Kramer, Some options: 1. Store in Cassandra with TTL = 24 hours. When you read the full table, you get the latest 24 hours data. 2. Store in Hive as ORC file and use timestamp field to filter out the old data. 3. Try windowing in spark or flink (have not used

RE: append rows to dataframe

2016-03-14 Thread Lohith Samaga M
If all sql results have same set of columns you could UNION all the dataframes Create an empty df and Union all Then reassign new df to original df before next union all Not sure if it is a good idea, but it works Lohith Sent from my Sony Xperia™ smartphone Divya Gehlot wrote Hi,

RE: pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-05 Thread Lohith Samaga M
Hi, If you can also format the condition file as a csv file similar to the main file, then you can join the two dataframes and select only required columns. Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Divya Gehlot

RE: Need to user univariate summary stats

2016-02-04 Thread Lohith Samaga M
Hi Arun, You can do df.agg(max(,,), min(..)). Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Arunkumar Pillai [mailto:arunkumar1...@gmail.com] Sent: Thursday, February 04, 2016 14.53 To: user@spark.apache.org Subject: Need to user univariate