Sampling data on RDD vs sampling data on Dataframes

2017-05-21 Thread Marco Didonna
Hello,

me and my team have developed a fairly large big data application using
only the dataframe api (Spark 1.6.3). Since our application uses machine
learning to do prediction we need to sample the train dataset in order not
to have skewed data.

To achieve such objective we use stratified sampling: now, you all probably
know that the DataFrameStatFunctions provided a useful sampleBy method that
supposedly carries out stratified sampling based on the fraction map passed
as input. There are a few question that have risen:

- the samplyBy methods seems to return variabile results with the same
input data therefore looks more like and *approximate* stratified sampling.
Inspection of the spark source code seems to confirm such hypothesis. There
is no mention on the documentation of such approximation nor a confidence
interval that guarantees how good the approximation is supposed to be.

- on the RDD world there is a sampleByKeyExact method which clearly states
that it will produce a sampled datasets with tight guarantees ... is there
anything like that in the DataFrame world?

Has anybody in the community worked around such shortcomings of the
dataframe api? I'm very much aware that I can get an rdd from a dataframe,
perform sampleByKeyExact and then convert the RDD back to a dataframe. I'd
really like to avoid such conversion, if possibile.

Thank you for any help you people can give :)

Best,

Marco


Ipython notebook, ec2 spark cluster and matplotlib

2015-07-10 Thread Marco Didonna
Hello everybody,
I'm running a two node spark cluster on ec2, created using the provided
scripts. I then ssh into the master and invoke
"PYSPARK_DRIVER_PYTHON=ipython  PYSPARK_DRIVER_PYTHON_OPTS='notebook
--profile=pyspark' spark/bin/pyspark". This launches a spark notebook which
has been instructed to listen to all interfaces, not only localhost. I then
open my browser and start playing around.

All commands run fine as far as I've seen but there's an annoying problem:
I cannot display matplotlib graphs in a cell, I get the following error
"TclError: no display name and no $DISPLAY environment variable".

I've searched the web and I've tried the following two approaches:

1. use -X to enable X11 forwarding: when I use this option I get no error,
a slow execution time and no image at all

2. use matplotlib.use('agg'), no image but if I execute fig.savefig I can
totally see the image being created.

Has anybody have a similar problem? If so can you help me troubleshoot?

Thanks,
MD


Spark MOOC by Berkeley and Databricks

2014-12-03 Thread Marco Didonna
Hello everybody,
in case you missed DataBricks and Berkeley have announced a free mooc on
spark and another one on scalable machine learning using spark. Both
courses are free but if you want to have a verified certificate of
completion you need to donate at least 50$. I did it, it's a great
investment!

Here's the link with all the info
http://databricks.com/blog/2014/12/02/announcing-two-spark-based-moocs.html

Have a nice day.

MD