from:"Marco Didonna"

Sampling data on RDD vs sampling data on Dataframes

2017-05-21 Thread Marco Didonna

Hello, me and my team have developed a fairly large big data application using only the dataframe api (Spark 1.6.3). Since our application uses machine learning to do prediction we need to sample the train dataset in order not to have skewed data. To achieve such objective we use stratified sampl

Ipython notebook, ec2 spark cluster and matplotlib

2015-07-10 Thread Marco Didonna

Hello everybody, I'm running a two node spark cluster on ec2, created using the provided scripts. I then ssh into the master and invoke "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --profile=pyspark' spark/bin/pyspark". This launches a spark notebook which has been instructe

Spark MOOC by Berkeley and Databricks

2014-12-03 Thread Marco Didonna

Hello everybody, in case you missed DataBricks and Berkeley have announced a free mooc on spark and another one on scalable machine learning using spark. Both courses are free but if you want to have a verified certificate of completion you need to donate at least 50$. I did it, it's a great invest

Sampling data on RDD vs sampling data on Dataframes

Ipython notebook, ec2 spark cluster and matplotlib

Spark MOOC by Berkeley and Databricks

3 matches

Site Navigation

Mail list logo

Footer information