Re: Scala, Python or Java for Spark programming

2017-06-07 Thread Matt Tenenbaum
A lot depends on your context as well. If I'm using Spark _for analysis_, I
frequently use python; it's a starting point, from which I can then
leverage pandas, matplotlib/seaborn, and other powerful tools available on
top of python.

If the Spark outputs are the ends themselves, rather than the means to
further exploration, Scala still feels like the "first class"
language---most thorough feature set, best debugging support, etc.

More crudely: if the eventual goal is a dataset, I tend to prefer Scala; if
it's a visualization or some summary values, I tend to prefer Python.

Of course, I also agree that this is more theological than technical.
Appropriately size your grains of salt.


On Wed, Jun 7, 2017 at 12:39 PM, Bryan Jeffrey 

> Mich,
> We use Scala for a large project.  On our team we've set a few standards
> to ensure readability (we try to avoid excessive use of tuples, use named
> functions, etc.)  Given these constraints, I find Scala to be very
> readable, and far easier to use than Java.  The Lambda functionality of
> Java provides a lot of similar features, but the amount of typing required
> to set down a small function is excessive at best!
> Regards,
> Bryan Jeffrey
> On Wed, Jun 7, 2017 at 12:51 PM, Jörn Franke  wrote:
>> I think this is a religious question ;-)
>> Java is often underestimated, because people are not aware of its lambda
>> functionality which makes the code very readable. Scala - it depends who
>> programs it. People coming with the normal Java background write Java-like
>> code in scala which might not be so good. People from a functional
>> background write it more functional like - i.e. You have a lot of things in
>> one line of code which can be a curse even for other functional
>> programmers, especially if the application is distributed as in the case of
>> Spark. Usually no comment is provided and you have - even as a functional
>> programmer - to do a lot of drill down. Python is somehow similar, but
>> since it has no connection with Java you do not have these extremes. There
>> it depends more on the community (e.g. Medical, financials) and skills of
>> people how the code look likes.
>> However the difficulty comes with the distributed applications behind
>> Spark which may have unforeseen side effects if the users do not know this,
>> ie if they have never been used to parallel programming.
>> On 7. Jun 2017, at 17:20, Mich Talebzadeh 
>> wrote:
>> Hi,
>> I am a fan of Scala and functional programming hence I prefer Scala.
>> I had a discussion with a hardcore Java programmer and a data scientist
>> who prefers Python.
>> Their view is that in a collaborative work using Scala programming it is
>> almost impossible to understand someone else's Scala code.
>> Hence I was wondering how much truth is there in this statement. Given
>> that Spark uses Scala as its core development language, what is the general
>> view on the use of Scala, Python or Java?
>> Thanks,
>> Dr Mich Talebzadeh
>> LinkedIn * 
>> *
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.

Re: spark-shell with different username

2016-04-02 Thread Matt Tenenbaum
Hi Mich. I certainly should have included that info in my original message
(sorry!): it's a mac, running OS X (10.11.3).


On Fri, Apr 1, 2016 at 11:16 PM, Mich Talebzadeh <>

> Matt,
> What OS are you using on your laptop? Sounds like Ubuntu or something?
> Thanks
> Dr Mich Talebzadeh
> LinkedIn * 
> <>*
> On 2 April 2016 at 01:17, Matt Tenenbaum <>
> wrote:
>> Hello all —
>> tl;dr: I’m having an issue running spark-shell from my laptop (or other
>> non-cluster-affiliated machine), and I think the issue boils down to
>> usernames. Can I convince spark/scala that I’m someone other than $USER?
>> A bit of background: our cluster is CDH 5.4.8, installed with Cloudera
>> Manager 5.5. We use LDAP, and my login on all hadoop-affiliated machines
>> (including the gateway boxes we use for running scheduled work) is
>> ‘matt.tenenbaum’. When I run spark-shell on one of those machines,
>> everything is fine:
>> [matt.tenenbaum@remote-machine ~]$ HADOOP_CONF_DIR=/etc/hadoop/conf 
>> SPARK_HOME=spark-1.6.0-bin-hadoop2.6 
>> spark-1.6.0-bin-hadoop2.6/bin/spark-shell --master yarn --deploy-mode client
>> Everything starts up correctly, I get a scala prompt, the SparkContext
>> and SQL context are correctly initialized, and I’m off to the races:
>> 16/04/01 23:27:00 INFO session.SessionState: Created local directory: 
>> /tmp/35b58974-dad5-43c6-9864-43815d101ca0_resources
>> 16/04/01 23:27:00 INFO session.SessionState: Created HDFS directory: 
>> /tmp/hive/matt.tenenbaum/35b58974-dad5-43c6-9864-43815d101ca0
>> 16/04/01 23:27:00 INFO session.SessionState: Created local directory: 
>> /tmp/matt.tenenbaum/35b58974-dad5-43c6-9864-43815d101ca0
>> 16/04/01 23:27:00 INFO session.SessionState: Created HDFS directory: 
>> /tmp/hive/matt.tenenbaum/35b58974-dad5-43c6-9864-43815d101ca0/_tmp_space.db
>> 16/04/01 23:27:00 INFO repl.SparkILoop: Created sql context (with Hive 
>> support)..
>> SQL context available as sqlContext.
>> scala> 1 + 41
>> res0: Int = 42
>> scala> sc
>> res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4e9bd2c8
>> I am running 1.6 from a downloaded tgz file, rather than the spark-shell
>> made available to the cluster from CDH. I can copy that tgz to my laptop,
>> and grab a copy of the cluster configurations, and in a perfect world I
>> would then be able to run everything in the same way
>> [matt@laptop ~]$ HADOOP_CONF_DIR=path/to/hadoop/conf 
>> SPARK_HOME=spark-1.6.0-bin-hadoop2.6 
>> spark-1.6.0-bin-hadoop2.6/bin/spark-shell --master yarn --deploy-mode client
>> Notice there are two things that are different:
>>1. My local username on my laptop is ‘matt’, which does not match my
>>name on the remote machine.
>>2. The Hadoop configs live somewhere other than /etc/hadoop/conf
>> Alas, #1 proves fatal because of cluster permissions (there is no
>> /user/matt/ in HDFS, and ‘matt’ is not a valid LDAP user). In the
>> initialization logging output, I can see that fail in an expected way:
>> 16/04/01 16:37:19 INFO yarn.Client: Setting up container launch context for 
>> our AM
>> 16/04/01 16:37:19 INFO yarn.Client: Setting up the launch environment for 
>> our AM container
>> 16/04/01 16:37:19 INFO yarn.Client: Preparing resources for our AM container
>> 16/04/01 16:37:20 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/04/01 16:37:21 ERROR spark.SparkContext: Error initializing SparkContext.
>> Permission denied: 
>> user=matt, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
>> at 
>> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(
>> at 
>> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(
>> at (... etc ...)
>> Fine. In other circumstances I’ve told Hadoop explicitly who I am by
>> setting HADOOP_USER_NAME. Maybe that works here?
>> [matt@laptop ~]$ HADOOP_USER_NAME=matt.tenenbaum HADOOP_CONF_DIR=s

spark-shell with different username

2016-04-01 Thread Matt Tenenbaum
Hello all —

tl;dr: I’m having an issue running spark-shell from my laptop (or other
non-cluster-affiliated machine), and I think the issue boils down to
usernames. Can I convince spark/scala that I’m someone other than $USER?

A bit of background: our cluster is CDH 5.4.8, installed with Cloudera
Manager 5.5. We use LDAP, and my login on all hadoop-affiliated machines
(including the gateway boxes we use for running scheduled work) is
‘matt.tenenbaum’. When I run spark-shell on one of those machines,
everything is fine:

[matt.tenenbaum@remote-machine ~]$ HADOOP_CONF_DIR=/etc/hadoop/conf
spark-1.6.0-bin-hadoop2.6/bin/spark-shell --master yarn --deploy-mode

Everything starts up correctly, I get a scala prompt, the SparkContext and
SQL context are correctly initialized, and I’m off to the races:

16/04/01 23:27:00 INFO session.SessionState: Created local directory:
16/04/01 23:27:00 INFO session.SessionState: Created HDFS directory:
16/04/01 23:27:00 INFO session.SessionState: Created local directory:
16/04/01 23:27:00 INFO session.SessionState: Created HDFS directory:
16/04/01 23:27:00 INFO repl.SparkILoop: Created sql context (with Hive
SQL context available as sqlContext.

scala> 1 + 41
res0: Int = 42

scala> sc
res1: org.apache.spark.SparkContext = org.apache.spark.SparkContext@4e9bd2c8

I am running 1.6 from a downloaded tgz file, rather than the spark-shell
made available to the cluster from CDH. I can copy that tgz to my laptop,
and grab a copy of the cluster configurations, and in a perfect world I
would then be able to run everything in the same way

[matt@laptop ~]$ HADOOP_CONF_DIR=path/to/hadoop/conf
spark-1.6.0-bin-hadoop2.6/bin/spark-shell --master yarn --deploy-mode

Notice there are two things that are different:

   1. My local username on my laptop is ‘matt’, which does not match my
   name on the remote machine.
   2. The Hadoop configs live somewhere other than /etc/hadoop/conf

Alas, #1 proves fatal because of cluster permissions (there is no
/user/matt/ in HDFS, and ‘matt’ is not a valid LDAP user). In the
initialization logging output, I can see that fail in an expected way:

16/04/01 16:37:19 INFO yarn.Client: Setting up container launch
context for our AM
16/04/01 16:37:19 INFO yarn.Client: Setting up the launch environment
for our AM container
16/04/01 16:37:19 INFO yarn.Client: Preparing resources for our AM container
16/04/01 16:37:20 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
16/04/01 16:37:21 ERROR spark.SparkContext: Error initializing SparkContext. Permission denied:
user=matt, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
at (... etc ...)

Fine. In other circumstances I’ve told Hadoop explicitly who I am by
setting HADOOP_USER_NAME. Maybe that works here?

[matt@laptop ~]$ HADOOP_USER_NAME=matt.tenenbaum
HADOOP_CONF_DIR=soma-conf SPARK_HOME=spark-1.6.0-bin-hadoop2.6
spark-1.6.0-bin-hadoop2.6/bin/spark-shell --master yarn --deploy-mode

Eventually that fails too, but not for the same reason. Setting
HADOOP_USER_NAME is sufficient to allow initialization to get past the
access-control problems, and I can see it request a new application from
the cluster

16/04/01 16:43:08 INFO yarn.Client: Will allocate AM container, with
896 MB memory including 384 MB overhead
16/04/01 16:43:08 INFO yarn.Client: Setting up container launch
context for our AM
16/04/01 16:43:08 INFO yarn.Client: Setting up the launch environment
for our AM container
16/04/01 16:43:08 INFO yarn.Client: Preparing resources for our AM container
... [resource uploads happen here] ...
16/04/01 16:46:16 INFO spark.SecurityManager: Changing view acls to:
16/04/01 16:46:16 INFO spark.SecurityManager: Changing modify acls to:
16/04/01 16:46:16 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(matt, matt.tenenbaum); users with modify permissions:
Set(matt, matt.tenenbaum)
16/04/01 16:46:16 INFO yarn.Client: Submitting application 30965 to
16/04/01 16:46:16 INFO impl.YarnClientImpl: Submitted application
16/04/01 16:46:17 INFO yarn.Client: Application report for
application_1451332794331_30965 (state: ACCEPTED)