RE: Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain

Thanks Rezaul…

Is Spark 2.1.0 still have any issues w.r.t. stability?

Regards,
Ankur

From: Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org]
Sent: Monday, January 09, 2017 5:02 PM
To: Ankur Jain <ankur.j...@yash.com>
Cc: user@spark.apache.org
Subject: Re: Machine Learning in Spark 1.6 vs Spark 2.0

Hello Jain,
I would recommend using Spark MLlib 
<http://spark.apache.org/docs/latest/ml-guide.html> (and ML) of Spark 2.1.0 
with the following features:

  *   ML Algorithms: common learning algorithms such as classification, 
regression, clustering, and collaborative filtering
  *   Featurization: feature extraction, transformation, dimensionality 
reduction, and selection
  *   Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
  *   Persistence: saving and load algorithms, models, and Pipelines
  *   Utilities: linear algebra, statistics, data handling, etc.
These features will help make your machine learning scalable and easy too.


Regards,
_
Md. Rezaul Karim, BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html<http://139.59.184.114/index.html>

On 9 January 2017 at 10:19, Ankur Jain 
<ankur.j...@yash.com<mailto:ankur.j...@yash.com>> wrote:
Hi Team,

I want to start a new project with ML. But wanted to know which version of 
Spark is much stable and have more features w.r.t ML
Please suggest your opinion…

Thanks in Advance…

[cid:image013.png@01D1AAE2.28F7BBF0]

Thanks & Regards
Ankur Jain
Technical Architect – Big Data | IoT | Innovation Group
Board: +91-731-663-6363<tel:+91%20731%20663%206363>
Direct: +91-731-663-6125<tel:+91%20731%20663%206125>
www.yash.com<http://www.yash.com/>
Follow YASH:

[cid:image002.png@01CF5E10.26C55CF0]<http://www.linkedin.com/company/yash-technologies>

[cid:image003.png@01CF5E10.26C55CF0]<http://twitter.com/YASH_Tech>

[cid:image004.png@01CF5E10.26C55CF0]<http://www.facebook.com/pages/YASH-Technologies/139932139377994>

[cid:image005.png@01CF5E10.26C55CF0]<https://plus.google.com/106560310768370862129/posts>

[cid:image006.png@01CF5E10.26C55CF0]<http://www.youtube.com/yashtechnologies>


 [Solutions-Architect-Associate]   [cid:image010.png@01D1AD0C.4AFA3760]   
[GPTWF LOGO]

'Information transmitted by this e-mail is proprietary to YASH Technologies 
and/ or its Customers and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com<mailto:i...@yash.com> and delete this 
mail from your records.

'Information transmitted by this e-mail is proprietary to YASH Technologies 
and/ or its Customers and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


Machine Learning in Spark 1.6 vs Spark 2.0

2017-01-09 Thread Ankur Jain
Hi Team,

I want to start a new project with ML. But wanted to know which version of 
Spark is much stable and have more features w.r.t ML
Please suggest your opinion...

Thanks in Advance...

[cid:image013.png@01D1AAE2.28F7BBF0]

Thanks & Regards
Ankur Jain
Technical Architect - Big Data | IoT | Innovation Group
Board: +91-731-663-6363
Direct: +91-731-663-6125
www.yash.com<http://www.yash.com/>
Follow YASH:

[cid:image002.png@01CF5E10.26C55CF0]<http://www.linkedin.com/company/yash-technologies>

[cid:image003.png@01CF5E10.26C55CF0]<http://twitter.com/YASH_Tech>

[cid:image004.png@01CF5E10.26C55CF0]<http://www.facebook.com/pages/YASH-Technologies/139932139377994>

[cid:image005.png@01CF5E10.26C55CF0]<https://plus.google.com/106560310768370862129/posts>

[cid:image006.png@01CF5E10.26C55CF0]<http://www.youtube.com/yashtechnologies>


 [Solutions-Architect-Associate]   [cid:image010.png@01D1AD0C.4AFA3760]   
[GPTWF LOGO]

'Information transmitted by this e-mail is proprietary to YASH Technologies 
and/ or its Customers and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


RE: Saving Parquet files to S3

2016-06-10 Thread Ankur Jain
Thanks maropu.. It worked…

From: Takeshi Yamamuro [mailto:linguin@gmail.com]
Sent: 10 June 2016 11:47 AM
To: Ankur Jain
Cc: user@spark.apache.org
Subject: Re: Saving Parquet files to S3

Hi,

You'd better off `setting parquet.block.size`.

// maropu

On Thu, Jun 9, 2016 at 7:48 AM, Daniel Siegmann 
<daniel.siegm...@teamaol.com<mailto:daniel.siegm...@teamaol.com>> wrote:
I don't believe there's anyway to output files of a specific size. What you can 
do is partition your data into a number of partitions such that the amount of 
data they each contain is around 1 GB.

On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain 
<ankur.j...@yash.com<mailto:ankur.j...@yash.com>> wrote:
Hello Team,

I want to write parquet files to AWS S3, but I want to size each file size to 1 
GB.
Can someone please guide me on how I can achieve the same?

I am using AWS EMR with spark 1.6.1.

Thanks,
Ankur
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com<mailto:i...@yash.com> and delete this 
mail from your records.




--
---
Takeshi Yamamuro
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


Saving Parquet files to S3

2016-06-09 Thread Ankur Jain
Hello Team,

I want to write parquet files to AWS S3, but I want to size each file size to 1 
GB.
Can someone please guide me on how I can achieve the same?

I am using AWS EMR with spark 1.6.1.

Thanks,
Ankur
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


dataframe stat corr for multiple columns

2016-05-17 Thread Ankur Jain
Hello Team,

In my current usecase I am loading data from CSV using spark-csv and trying to 
correlate all variables.

As of now if we want to correlate 2 column in a dataframe df.stat.corr works 
great but if we want to correlate multiple columns this won't work.
In case of R we can use corrplot and correlate all numeric columns in a single 
line of code. Can you guide me how to achieve the same with dataframe or sql?

There seems a way in spark-mllib
http://spark.apache.org/docs/latest/mllib-statistics.html

[cid:image001.png@01D1B069.D3099410]

But it seems that it don't take input as dataframe...

Regards,
Ankur
Information transmitted by this e-mail is proprietary to YASH Technologies and/ 
or its Customers and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If you are not the 
intended recipient or it appears that this mail has been forwarded to you 
without proper authority, you are notified that any use or dissemination of 
this information in any manner is strictly prohibited. In such cases, please 
notify us immediately at i...@yash.com and delete this mail from your records.


RE: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Ankur Jain
I had installed spark via bootstrap in EMR.

https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark

However when I run spark without yarn (local) and that one is working fine…..

Thanks
Ankur

From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com]
Sent: Wednesday, March 25, 2015 7:31 PM
To: Ankur Jain
Cc: user@spark.apache.org
Subject: Re: JavaKinesisWordCountASLYARN Example not working on EMR

Did you built for kineses using profile -Pkinesis-asl

On Wed, Mar 25, 2015 at 7:18 PM, ankur.jain 
ankur.j...@yash.commailto:ankur.j...@yash.com wrote:
Hi,
I am trying to run a Spark on YARN program provided by Spark in the examples
directory using Amazon Kinesis on EMR cluster :
I am using Spark 1.3.0 and EMR AMI: 3.5.0

I've setup the Credentials
export AWS_ACCESS_KEY_ID=XX
export AWS_SECRET_KEY=XXX

*A) This is the Kinesis Word Count Producer which ran Successfully : *
run-example org.apache.spark.examples.streaming.KinesisWordCountProducerASL
mySparkStream https://kinesis.us-east-1.amazonaws.com 1 5

*B) This one is the Normal Consumer using Spark Streaming which also ran
Successfully: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASL
mySparkStream https://kinesis.us-east-1.amazonaws.com

*C) And this is the YARN based program which is NOT WORKING: *
run-example org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN
mySparkStream https://kinesis.us-east-1.amazonaws.com\
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
15/03/25 11:52:45 INFO spark.SparkContext: Running Spark version 1.3.0
15/03/25 11:52:45 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar').
This is deprecated in Spark 1.0+.
Please instead use:
•   ./spark-submit with --driver-class-path to augment the driver classpath
•   spark.executor.extraClassPath to augment the executor classpath
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:45 WARN spark.SparkConf: Setting
'spark.driver.extraClassPath' to
'/home/hadoop/spark/conf:/home/hadoop/conf:/home/hadoop/spark/classpath/emr/:/home/hadoop/spark/classpath/emrfs/:/home/hadoop/share/hadoop/common/lib/:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar'
as a work-around.
15/03/25 11:52:46 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: Changing modify acls to:
hadoop
15/03/25 11:52:46 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hadoop); users with modify permissions: Set(hadoop)
15/03/25 11:52:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/25 11:52:48 INFO Remoting: Starting remoting
15/03/25 11:52:48 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@ip-10-80-175-92.ec2.internal:59504]
15/03/25 11:52:48 INFO util.Utils: Successfully started service
'sparkDriver' on port 59504.
15/03/25 11:52:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/25 11:52:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/25 11:52:48 INFO storage.DiskBlockManager: Created local directory at
/mnt/spark/spark-120befbc-6dae-4751-b41f-dbf7b3d97616/blockmgr-d339d180-36f5-465f-bda3-cecccb23b1d3
15/03/25 11:52:48 INFO storage.MemoryStore: MemoryStore started with
capacity 265.4 MB
15/03/25 11:52:48 INFO spark.HttpFileServer: HTTP File server directory is
/mnt/spark/spark-85e88478-3dad-4fcf-a43a-efd15166bef3/httpd-6115870a-0d90-44df-aa7c-a6bd1a47e107
15/03/25 11:52:48 INFO spark.HttpServer: Starting HTTP Server
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:44879http://SocketConnector@0.0.0.0:44879
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'HTTP file
server' on port 44879.
15/03/25 11:52:49 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/03/25 11:52:49 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/25 11:52:49 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040http://SelectChannelConnector@0.0.0.0:4040
15/03/25 11:52:49 INFO util.Utils: Successfully started service 'SparkUI' on
port 4040.
15/03/25 11:52:49 INFO ui.SparkUI: Started SparkUI at
http://ip-10-80-175-92.ec2.internal:4040
15/03/25 11:52:50 INFO spark.SparkContext: Added JAR
file:/home/hadoop/spark/lib/spark-examples-1.3.0-hadoop2.4.0.jar at
http://10.80.175.92:44879/jars/spark-examples-1.3.0-hadoop2.4.0.jar with
timestamp 1427284370358
15/03/25 11:52:50 INFO