Sampling data on RDD vs sampling data on Dataframes
Hello, me and my team have developed a fairly large big data application using only the dataframe api (Spark 1.6.3). Since our application uses machine learning to do prediction we need to sample the train dataset in order not to have skewed data. To achieve such objective we use stratified sampling: now, you all probably know that the DataFrameStatFunctions provided a useful sampleBy method that supposedly carries out stratified sampling based on the fraction map passed as input. There are a few question that have risen: - the samplyBy methods seems to return variabile results with the same input data therefore looks more like and *approximate* stratified sampling. Inspection of the spark source code seems to confirm such hypothesis. There is no mention on the documentation of such approximation nor a confidence interval that guarantees how good the approximation is supposed to be. - on the RDD world there is a sampleByKeyExact method which clearly states that it will produce a sampled datasets with tight guarantees ... is there anything like that in the DataFrame world? Has anybody in the community worked around such shortcomings of the dataframe api? I'm very much aware that I can get an rdd from a dataframe, perform sampleByKeyExact and then convert the RDD back to a dataframe. I'd really like to avoid such conversion, if possibile. Thank you for any help you people can give :) Best, Marco
Ipython notebook, ec2 spark cluster and matplotlib
Hello everybody, I'm running a two node spark cluster on ec2, created using the provided scripts. I then ssh into the master and invoke "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS='notebook --profile=pyspark' spark/bin/pyspark". This launches a spark notebook which has been instructed to listen to all interfaces, not only localhost. I then open my browser and start playing around. All commands run fine as far as I've seen but there's an annoying problem: I cannot display matplotlib graphs in a cell, I get the following error "TclError: no display name and no $DISPLAY environment variable". I've searched the web and I've tried the following two approaches: 1. use -X to enable X11 forwarding: when I use this option I get no error, a slow execution time and no image at all 2. use matplotlib.use('agg'), no image but if I execute fig.savefig I can totally see the image being created. Has anybody have a similar problem? If so can you help me troubleshoot? Thanks, MD
Spark MOOC by Berkeley and Databricks
Hello everybody, in case you missed DataBricks and Berkeley have announced a free mooc on spark and another one on scalable machine learning using spark. Both courses are free but if you want to have a verified certificate of completion you need to donate at least 50$. I did it, it's a great investment! Here's the link with all the info http://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html Have a nice day. MD