RE: IOT in Spark

2017-05-19 Thread Lohith Samaga M
Hi Gaurav,
You can process IoT data using Spark. But where will you store the 
raw/processed data -  Cassandra, Hive, HBase?
You might want to look at the Hadoop cluster for data storage and 
processing (Spark using Yarn).
For processing streaming data, you might also explore Apache Storm and 
Apache Flink.
I suggest it is better to do a POC in each of them and then decide on 
what works best for you.

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga





-Original Message-
From: Gaurav1809 [mailto:gauravhpan...@gmail.com] 
Sent: Friday, May 19, 2017 09.28
To: user@spark.apache.org
Subject: IOT in Spark

Hello gurus,

How exactly it works in real world scenarios when it come to read data from IOT 
devices (say for example censors at in/out gate in huge mall)? Can we do it in 
Spark? Do we need to use any other tool/utility (kafka???) to read data from 
those censors and then process them in Spark? Please share your thoughts on 
this and it will give me headstart to my work. I am completely unaware of the 
technology stack that can be used here, so any pointers to this will be so much 
helpful. Thanks.

-Gaurav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/IOT-in-Spark-tp28698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.1.0 with Hive 2.1.1?

2017-05-08 Thread Lohith Samaga M
Hi,
Good day.

My setup:

  1.  Single node Hadoop 2.7.3 on Ubuntu 16.04.
  2.  Hive 2.1.1 with metastore in MySQL.
  3.  Spark 2.1.0 configured using hive-site.xml to use MySQL metastore.
  4.  The VERSION table contains SCHEMA_VERSION = 2.1.0

Hive CLI works fine.
However, when I start Spark-shell or Spark-sql, SCHEMA_VERSION 
is set to 1.2.0 by spark.
Hive CLI then fails to start. After manual update of VERSION 
table, it works fine again.

I see in the spark/jars directory that hive related jars are of 
version 1.2.1
I tried building spark from source and as spark uses hive 1.2.1 
by default, I get the same set of jars.

How can we make Spark 2.1.0 work with Hive 2.1.1?

Thanks in advance!


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga

[cid:image001.png@01CF21DD.E9EBFDC0]

Block C, Tech Bay, PL Compound, Jeppu Ferry Road, Morgan's gate, Mangalore 575 
001, India
T +91 824 423 1172 Ext. 1172 | CUG Ext. #5651172 |M +91 9880393463 | 
lohith.sam...@mphasis.com
www.mphasis.com


Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Spark 2.1.0 and Hive 2.1.1

2017-05-03 Thread Lohith Samaga M
Hi,
Good day.

My setup:

  1.  Single node Hadoop 2.7.3 on Ubuntu 16.04.
  2.  Hive 2.1.1 with metastore in MySQL.
  3.  Spark 2.1.0 configured using hive-site.xml to use MySQL metastore.
  4.  The VERSION table contains SCHEMA_VERSION = 2.1.0

Hive CLI works fine.
However, when I start Spark-shell or Spark-sql, SCHEMA_VERSION 
is set to 1.2.0 by spark.
Hive CLI then fails to start. After manual update of VERSION 
table, it works fine again.

I see in the spark/jars directory that hive related jars are of 
version 1.2.1
I tried building spark from source and as spark uses hive 1.2.1 
by default, I get the same set of jars.

How can we make Spark 2.1.0 work with Hive 2.1.1?

Thanks in advance!

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Spark 2 or Spark 1.6.x?

2016-12-11 Thread Lohith Samaga M
Hi,
I am new to Spark. I would like to learn Spark.
I think I should learn version 2.0.2.
Or should I still go for version 1.6.x and then come to version 2.0.2?

Please advise.

Thanks in advance.

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Lohith Samaga M
Hi,
The aws CLI already has your access key aid and secret access 
key when you initially configured it.
Is your s3 bucket without any access restrictions?


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04, 2016 15.06
To: Apache Spark
Subject: RE: Cluster mode deployment from jar in S3

Sorry to do this...but... *bump*


From: as...@live.com
To: user@spark.apache.org
Subject: Cluster mode deployment from jar in S3
Date: Fri, 1 Jul 2016 17:45:12 +0100
Hello,
I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs 
using "--deploy-mode client", however using "--deploy-mode cluster" is proving 
to be a challenge. I've tries this:

spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar

When I do this, I get:
16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).
at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)


Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and Secret as mentioned, but the role 
the machine's are in allow them to copy the jar. In other words, this works:

aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar

I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting that... that works in 
client mode, but I get a not found error when using cluster mode.

Any help will be appreciated.

Thanks,
Ashic.
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Re: migration from Teradata to Spark SQL

2016-05-04 Thread Lohith Samaga M
Hi
Can you look at Apache Drill as sql engine on hive?

Lohith

Sent from my Sony Xperia™ smartphone


 Tapan Upadhyay wrote 

Thank you everyone for guidance.

Jorn our motivation is to move bulk of adhoc queries to hadoop so that we have 
enough bandwidth on our DB for imp batch/queries.

For implementing lambda architecture is it possible to get the real time 
updates from Teradata of any insert/update/delete? DBlogs?

Deepak should we query data from cassandra using spark? how it will be 
different in terms of performance if we store our data in hive tables(parquet) 
and query using spark? in case there is not much performance gain why add one 
more layer of processing

Mich we plan to sync the data using sqoop hourly/EOD jobs? still not decided 
how frequently we would need to do that. It will be based on user requirement. 
In case they need real time data we need to think of an alternative? How are 
you doing the same for Sybase? How you sync real time?

Thank you!!


Regards,
Tapan Upadhyay
+1 973 652 8757

On Wed, May 4, 2016 at 4:33 AM, Alonso Isidoro Roman 
> wrote:
I agree with Deepak and i would try to save data in parquet and avro format, if 
you can, try to measure the performance and choose the best, it will probably 
be parquet, but you have to know for yourself.

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces programar 
debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming must 
be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 9:22 GMT+02:00 Jörn Franke 
>:
Look at lambda architecture.

What is the motivation of your migration?

On 04 May 2016, at 03:29, Tapan Upadhyay 
> wrote:

Hi,

We are planning to move our adhoc queries from teradata to spark. We have huge 
volume of queries during the day. What is best way to go about it -

1) Read data directly from teradata db using spark jdbc

2) Import data using sqoop by EOD jobs into hive tables stored as parquet and 
then run queries on hive tables using spark sql or spark hive context.

any other ways through which we can do it in a better/efficiently?

Please guide.

Regards,
Tapan



Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


RE: Why Spark having OutOfMemory Exception?

2016-04-11 Thread Lohith Samaga M
Hi Kramer,
Some options:
1. Store in Cassandra with TTL = 24 hours. When you read the full 
table, you get the latest 24 hours data.
2. Store in Hive as ORC file and use timestamp field to filter out the 
old data.
3. Try windowing in spark or flink (have not used either).


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


-Original Message-
From: kramer2...@126.com [mailto:kramer2...@126.com] 
Sent: Monday, April 11, 2016 16.18
To: user@spark.apache.org
Subject: Why Spark having OutOfMemory Exception?

I use spark to do some very simple calculation. The description is like below 
(pseudo code):


While timestamp == 5 minutes

df = read_hdf() # Read hdfs to get a dataframe every 5 minutes

my_dict[timestamp] = df # Put the data frame into a dict

delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
24 hour before)

big_df = merge(my_dict) # Merge the recent 24 hours data frame

To explain..

I have new files comes in every 5 minutes. But I need to generate report on 
recent 24 hours data. 
The concept of 24 hours means I need to delete the oldest data frame every time 
I put a new one into it.
So I maintain a dict (my_dict in above code), the dict contains map like
timestamp: dataframe. Everytime I put dataframe into the dict, I will go 
through the dict to delete those old data frame whose timestamp is 24 hour ago.
After delete and input. I merge the data frames in the dict to a big one and 
run SQL on it to get my report.

*
I want to know if any thing wrong about this model? Because it is very slow 
after started for a while and hit OutOfMemory. I know that my memory is enough. 
Also size of file is very small for test purpose. So should not have memory 
problem.

I am wondering if there is lineage issue, but I am not sure. 

*



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Spark-having-OutOfMemory-Exception-tp26743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org

Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: append rows to dataframe

2016-03-14 Thread Lohith Samaga M
If all sql results have same set of columns you could UNION all the dataframes

Create an empty df and Union all
Then reassign new df to original df before next union all

Not sure if it is a good idea, but it works

Lohith

Sent from my Sony Xperia™ smartphone


 Divya Gehlot wrote 

Hi,

Please bear me for asking such a naive question
I have list of conditions (dynamic sqls) sitting in hbase table .
I need to iterate through those dynamic sqls and add the data to dataframes.
As we know dataframes are immutable ,when I try to iterate in for loop as shown 
below I get only last dynamic sql result set .

var dffiltered : DataFrame = sqlContext.emptyDataFrame
 for ( i <- 0 to (dfFilterSQLs.length - 1)) {
 println("Condition="+dfFilterSQLs(i))
 dffiltered = 
dfresult.filter(dfFilterSQLs(i)).select("Col1","Col2","Col3","Col4","Col5")
  dffiltered.show
  }


How can I keep on appending data to dataframe and get the final result having 
all the sql conditions.

Thanks in advance for the help.

Thanks,
Divya
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


RE: pass one dataframe column value to another dataframe filter expression + Spark 1.5 + scala

2016-02-05 Thread Lohith Samaga M
Hi,
If you can also format the condition file as a csv file similar 
to the main file, then you can join the two dataframes and select only required 
columns.

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga

From: Divya Gehlot [mailto:divya.htco...@gmail.com]
Sent: Friday, February 05, 2016 13.12
To: user @spark
Subject: pass one dataframe column value to another dataframe filter expression 
+ Spark 1.5 + scala

Hi,
I have two input datasets
First input dataset like as below :

year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they are going fast",
2015,Chevy,Volt

Second Input dataset :

TagId,condition
1997_cars,year = 1997 and model = 'E350'
2012_cars,year=2012 and model ='S'
2015_cars ,year=2015 and model = 'Volt'

Now my requirement is read first data set and based on the filtering condition 
in second dataset need to tag rows of first input dataset by introducing a new 
column TagId to first input data set
so the expected should look like :

year,make,model,comment,blank,TagId
"2012","Tesla","S","No comment",2012_cars
1997,Ford,E350,"Go get one now they are going fast",1997_cars
2015,Chevy,Volt, ,2015_cars

I tried like :

val sqlContext = new SQLContext(sc)
val carsSchema = StructType(Seq(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))

val carTagsSchema = StructType(Seq(
StructField("TagId", StringType, true),
StructField("condition", StringType, true)))


val dfcars = 
sqlContext.read.format("com.databricks.spark.csv").option("header", "true") 
.schema(carsSchema).load("/TestDivya/Spark/cars.csv")
val dftags = 
sqlContext.read.format("com.databricks.spark.csv").option("header", "true") 
.schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")

val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
val cdtnval = dftags.select("condition")
val df2=dfcars.filter(cdtnval)
:35: error: overloaded method value filter with alternatives:
  (conditionExpr: String)org.apache.spark.sql.DataFrame 
  (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
 cannot be applied to (org.apache.spark.sql.DataFrame)
   val df2=dfcars.filter(cdtnval)

another way :

val col = dftags.col("TagId")
val finaldf = dfcars.withColumn("TagId", col)
org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5 
missing from comment#3,blank#4,model#2,make#1,year#0 in operator !Project 
[year#0,make#1,model#2,comment#3,blank#4,TagId#5 AS TagId#8];

finaldf.write.format("com.databricks.spark.csv").option("header", 
"true").save("/TestDivya/Spark/carswithtags.csv")


Would really appreciate if somebody give me pointers how can I pass the filter 
condition(second dataframe) to filter function of first dataframe.
Or another solution .
My apppologies for such a naive question as I am new to scala and Spark

Thanks
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


RE: Need to user univariate summary stats

2016-02-04 Thread Lohith Samaga M
Hi Arun,
You can do df.agg(max(,,), min(..)).


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga

From: Arunkumar Pillai [mailto:arunkumar1...@gmail.com]
Sent: Thursday, February 04, 2016 14.53
To: user@spark.apache.org
Subject: Need to user univariate summary stats

Hi

I'm currently using query

sqlContext.sql("SELECT MAX(variablesArray) FROM " + tableName)

to extract mean max min.
is there any better  optimized way ?


In the example i saw df.groupBy("key").agg(skewness("a"), kurtosis("a"))

But i don't have key anywhere in the data.

How to extract the univariate summary stats from df. please help

--
Thanks and Regards
Arun
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.