Trouble with PySpark UDFs and SPARK_HOME only on EMR

2017-06-22 Thread Nick Chammas
I’m seeing a strange issue on EMR which I posted about here

.

In brief, when I try to import a UDF I’ve defined, Python somehow fails to
find Spark. This exact code works for me locally and works on our
on-premises CDH cluster under YARN.

This is the traceback:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 318, in show
print(self._jdf.showString(n, 20))
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o89.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3
in stage 0.0 (TID 3, ip-10-97-35-12.ec2.internal, executor 1):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
line 161, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
line 91, in read_udfs
_, udf = read_single_udf(pickleSer, infile)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
line 78, in read_single_udf
f, return_type = read_command(pickleSer, infile)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/worker.py",
line 54, in read_command
command = serializer._read_with_length(file)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
line 169, in _read_with_length
return self.loads(obj)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/serializers.py",
line 451, in loads
return pickle.loads(obj, encoding=encoding)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/person.py",
line 7, in 
from splinkr.util import repartition_to_size
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/splinkr/util.py",
line 34, in 
containsNull=False,
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
line 1872, in udf
return UserDefinedFunction(f, returnType)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
line 1830, in __init__
self._judf = self._create_judf(name)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/sql/functions.py",
line 1834, in _create_judf
sc = SparkContext.getOrCreate()
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/context.py",
line 310, in getOrCreate
SparkContext(conf=conf or SparkConf())
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/context.py",
line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/context.py",
line 259, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
  File 
"/mnt/yarn/usercache/hadoop/appcache/application_1498141399866_0005/container_1498141399866_0005_01_02/pyspark.zip/pyspark/java_gateway.py",
line 77, in launch_gateway
proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
  File "/usr/lib64/python3.5/subprocess.py", line 950, in __init__
restore_signals, start_new_session)
  File "/usr/lib64/python3.5/subprocess.py", line 1544, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: './bin/spark-submit'

Does anyone have clues about what might be going on?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-with-PySpark-UDFs-and-SPARK-HOME-only-o

Reading ORC file - fine on 1.6; GC timeout on 2+

2017-05-05 Thread Nick Chammas
I have this ORC file that was generated by a Spark 1.6 program. It opens
fine in Spark 1.6 with 6GB of driver memory, and probably less.

However, when I try to open the  same file in Spark 2.0 or 2.1, I get GC
timeout exceptions. And this is with 6, 8, and even 10GB of driver memory.

This is strange and smells like buggy behavior. How can I debug this or
workaround it in Spark 2+?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-ORC-file-fine-on-1-6-GC-timeout-on-2-tp28654.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark fair scheduler pools vs. YARN queues

2017-04-05 Thread Nick Chammas
I'm having trouble understanding the difference between Spark fair
scheduler pools

and YARN queues
.
Do they conflict? Does one override the other?

I posted a more detailed question about an issue I'm having with this on
Stack Overflow: http://stackoverflow.com/q/43239921/877069

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-fair-scheduler-pools-vs-YARN-queues-tp28572.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark-ec2 vs. EMR

2015-12-01 Thread Nick Chammas
Pinging this thread in case anyone has thoughts on the matter they want to
share.

On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Spark has come bundled with spark-ec2
>  for many years. At
> the same time, EMR has been capable of running Spark for a while, and
> earlier this year it added "official" support
> .
>
> If you're looking for a way to provision Spark clusters, there are some
> clear differences between these 2 options. I think the biggest one would be
> that EMR is a "production" solution backed by a company, whereas spark-ec2
> is not really intended for production use (as far as I know).
>
> That particular difference in intended use may or may not matter to you,
> but I'm curious:
>
> What are some of the other differences between the 2 that do matter to
> you? If you were considering these 2 solutions for your use case at one
> point recently, why did you choose one over the other?
>
> I'd be especially interested in hearing about why people might choose
> spark-ec2 over EMR, since the latter option seems to have shaped up nicely
> this year.
>
> Nick
>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-spark-ec2-vs-EMR-tp25538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[PSA] Use Stack Overflow!

2015-05-02 Thread Nick Chammas
This mailing list sees a lot of traffic every day.

With such a volume of mail, you may find it hard to find discussions you
are interested in, and if you are the one starting discussions you may
sometimes feel your mail is going into a black hole.

We can't change the nature of this mailing list (it is required by the
Apache foundation), but there is an alternative in Stack Overflow
.

Stack Overflow has an active Apache Spark tag
, and Spark
committers like Sean Owen 
and Josh Rosen  are
active under that tag, along with several contributors and of course
regular users.

Try it out! I think when your question fits Stack Overflow's guidelines
, it is generally better to ask it
there than on this mailing list.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PSA-Use-Stack-Overflow-tp22732.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Submit Spark applications from a machine that doesn't have Java installed

2015-01-11 Thread Nick Chammas
Is it possible to submit a Spark application to a cluster from a machine
that does not have Java installed?

My impression is that many, many more computers come with Python installed
by default than do with Java.

I want to write a command-line utility
 that submits a Spark
application to a remote cluster. I want that utility to run on as many
machines as possible out-of-the-box, so I want to avoid a dependency on
Java (or a JRE) if possible.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submit-Spark-applications-from-a-machine-that-doesn-t-have-Java-installed-tp21085.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Discourse: A proposed alternative to the Spark User list

2014-12-24 Thread Nick Chammas
When people have questions about Spark, there are 2 main places (as far as
I can tell) where they ask them:

   - Stack Overflow, under the apache-spark tag
   
   - This mailing list

The mailing list is valuable as an independent place for discussion that is
part of the Spark project itself. Furthermore, it allows for a broader
range of discussions than would be allowed on Stack Overflow
.

As the Spark project has grown in popularity, I see that a few problems
have emerged with this mailing list:

   - It’s hard to follow topics (e.g. Streaming vs. SQL) that you’re
   interested in, and it’s hard to know when someone has mentioned you
   specifically.
   - It’s hard to search for existing threads and link information across
   disparate threads.
   - It’s hard to format code and log snippets nicely, and by extension,
   hard to read other people’s posts with this kind of information.

There are existing solutions to all these (and other) problems based around
straight-up discipline or client-side tooling, which users have to conjure
up for themselves.

I’d like us as a community to consider using Discourse
 as an alternative to, or overlay on top of,
this mailing list, that provides better out-of-the-box solutions to these
problems.

Discourse is a modern discussion platform built by some of the same people
who created Stack Overflow. It has many neat features
 that I believe this community would
benefit from.

For example:

   - When a user starts typing up a new post, they get a panel *showing
   existing conversations that look similar*, just like on Stack Overflow.
   - It’s easy to search for posts and link between them.
   - *Markdown support* is built-in to composer.
   - You can *specifically mention people* and they will be notified.
   - Posts can be categorized (e.g. Streaming, SQL, etc.).
   - There is a built-in option for mailing list support which forwards all
   activity on the forum to a user’s email address and which allows for
   creation of new posts via email.

What do you think of Discourse as an alternative, more manageable way to
discus Spark?

There are a few options we can consider:

   1. Work with the ASF as well as the Discourse team to allow Discourse to
   act as an overlay on top of this mailing list
   
,
   allowing people to continue to use the mailing list as-is if they want.
   (This is the toughest but perhaps most attractive option.)
   2. Create a new Discourse forum for Spark that is not bound to this user
   list. This is relatively easy but will effectively fork the community on
   this list. (We cannot shut down this mailing in favor of one managed by
   Discourse.)
   3. Don’t use Discourse. Just encourage people on this list to post
   instead on Stack Overflow whenever possible.
   4. Something else.

What does everyone think?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Multipart uploads to Amazon S3 from Apache Spark

2014-10-13 Thread Nick Chammas
Cross posting an interesting question on Stack Overflow

.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multipart-uploads-to-Amazon-S3-from-Apache-Spark-tp16315.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Write 1 RDD to multiple output paths in one go

2014-09-13 Thread Nick Chammas
Howdy doody Spark Users,

I’d like to somehow write out a single RDD to multiple paths in one go.
Here’s an example.

I have an RDD of (key, value) pairs like this:

>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda 
>>> x: x[0])>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]

Now I want to write the RDD out to different paths depending on the keys,
so that I have one output directory per distinct key. Each output directory
could potentially have multiple part- files or whatever.

So my output would be something like:

/path/prefix/n [/part-1, /part-2, etc]
/path/prefix/b [/part-1, /part-2, etc]
/path/prefix/f [/part-1, /part-2, etc]

How would you do that?

I suspect I need to use saveAsNewAPIHadoopFile

or saveAsHadoopFile

along with the MultipleTextOutputFormat output format class, but I’m not
sure how.

By the way, there is a very similar question to this here on Stack Overflow

.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Write-1-RDD-to-multiple-output-paths-in-one-go-tp14174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

[PySpark] large # of partitions causes OOM

2014-08-29 Thread Nick Chammas
Here’s a repro for PySpark:

a = sc.parallelize(["Nick", "John", "Bob"])
a = a.repartition(24000)
a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)

When I try this on an EC2 cluster with 1.1.0-rc2 and Python 2.7, this is
what I get:

>>> a = sc.parallelize(["Nick", "John", "Bob"])>>> a = a.repartition(24000)>>> 
>>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)14/08/29 
>>> 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
>>> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
>>> heart beats: 175143ms exceeds 45000ms14/08/29 21:53:50 WARN 
>>> BlockManagerMasterActor: Removing BlockManager BlockManagerId(10, 
>>> ip-10-138-18-106.ec2.internal, 33711, 0) with no recent heart beats: 
>>> 175359ms exceeds 45000ms14/08/29 21:54:02 WARN BlockManagerMasterActor: 
>>> Removing BlockManager BlockManagerId(19, ip-10-139-36-207.ec2.internal, 
>>> 52208, 0) with no recent heart beats: 173061ms exceeds 45000ms14/08/29 
>>> 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
>>> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
>>> heart beats: 176816ms exceeds 45000ms14/08/29 21:54:22 WARN 
>>> BlockManagerMasterActor: Removing BlockManager BlockManagerId(7, 
>>> ip-10-236-145-200.ec2.internal, 40959, 0) with no recent heart beats: 
>>> 182241ms exceeds 45000ms14/08/29 21:54:40 WARN BlockManagerMasterActor: 
>>> Removing BlockManager BlockManagerId(4, ip-10-139-1-195.ec2.internal, 
>>> 49221, 0) with no recent heart beats: 178406ms exceeds 45000ms14/08/29 
>>> 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver thread-3
java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at 
org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at 
org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR
SendingConnection: Exception while reading SendingConnection to
ConnectionManagerId(ip-10-73-142-223.ec2.internal,54014)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at 
org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:199)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at 
org.apache.spark.serializer.KryoSerializerInstance.dese

Spark Screencast doesn't show in Chrome on OS X

2014-08-25 Thread Nick Chammas
https://spark.apache.org/screencasts/1-first-steps-with-spark.html

The embedded YouTube video shows up in Safari on OS X but not in Chrome.

How come?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Screencast-doesn-t-show-in-Chrome-on-OS-X-tp12782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How do you debug a PythonException?

2014-07-29 Thread Nick Chammas
I’m in the PySpark shell and I’m trying to do this:

a = 
sc.textFile('s3n://path-to-handful-of-very-large-files-totalling-1tb/*.json',
minPartitions=sc.defaultParallelism * 3).cache()
a.map(lambda x: len(x)).max()

My job dies with the following:

14/07/30 01:46:28 WARN TaskSetManager: Loss was due to
org.apache.spark.api.python.PythonException
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 73, in main
command = pickleSer._read_with_length(infile)
  File "/root/spark/python/pyspark/serializers.py", line 142, in
_read_with_length
length = read_int(stream)
  File "/root/spark/python/pyspark/serializers.py", line 337, in read_int
raise EOFError
EOFError

at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:115)
at org.apache.spark.api.python.PythonRDD$$anon$1.(PythonRDD.scala:145)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:78)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/07/30 01:46:29 ERROR TaskSchedulerImpl: Lost executor 19 on
ip-10-190-171-217.ec2.internal: remote Akka client disassociated

How do I debug this? I’m using 1.0.2-rc1 deployed to EC2.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-debug-a-PythonException-tp10906.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Include permalinks in mail footer

2014-07-17 Thread Nick Chammas
Can we modify the mailing list to include permalinks to the thread in the
footer of every email? Or at least of the initial email in a thread?

I often find myself wanting to reference one thread from another, or from a
JIRA issue. Right now I have to google the thread subject and find the link
that way.

It would be nice to be able to find the permalink I need from the thread
itself.

It might also be helpful for people to include an unsubscribe link in the
footer. That is a common practice in most mailing lists.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Include-permalinks-in-mail-footer-tp10075.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark SQL throws ClassCastException on first try; works on second

2014-07-14 Thread Nick Chammas
I’m running this query against RDD[Tweet], where Tweet is a simple case
class with 4 fields.

sqlContext.sql("""
  SELECT user, COUNT(*) as num_tweets
  FROM tweets
  GROUP BY user
  ORDER BY
num_tweets DESC,
user ASC
  ;
""").take(5)

The first time I run this, it throws the following:

14/07/15 06:11:51 ERROR TaskSetManager: Task 12.0:0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 12.0:0 failed 4 times, most recent failure: Exception failure in
TID 978 on host ip-10-144-204-254.ec2.internal:
java.lang.ClassCastException: java.lang.Long cannot be cast to
java.lang.String
scala.math.Ordering$String$.compare(Ordering.scala:329)

org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(Row.scala:227)

org.apache.spark.sql.catalyst.expressions.RowOrdering.compare(Row.scala:210)
java.util.TimSort.mergeLo(TimSort.java:687)
java.util.TimSort.mergeAt(TimSort.java:483)
java.util.TimSort.mergeCollapse(TimSort.java:410)
java.util.TimSort.sort(TimSort.java:214)
java.util.TimSort.sort(TimSort.java:173)
java.util.Arrays.sort(Arrays.java:659)
scala.collection.SeqLike$class.sorted(SeqLike.scala:615)
scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:108)

org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:154)

org.apache.spark.sql.execution.Sort$$anonfun$execute$3$$anonfun$apply$4.apply(basicOperators.scala:154)
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
org.apache.spark.rdd.RDD$$anonfun$12.apply(RDD.scala:559)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:110)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

If I immediately re-run the query, it works fine. I’ve been able to
reproduce this a few times. If I run other, simpler SELECT queries first
and then this one, it also gets around the problem. Strange…

I’m on 1.0.0 on EC2.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-throws-ClassCastException-on-first-try-works-on-second-tp9720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Supported SQL syntax in Spark SQL

2014-07-12 Thread Nick Chammas
Is there a place where we can find an up-to-date list of supported SQL
syntax in Spark SQL?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Supported-SQL-syntax-in-Spark-SQL-tp9538.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Stopping StreamingContext does not kill receiver

2014-07-12 Thread Nick Chammas
>From the interactive shell I’ve created a StreamingContext.

I call ssc.start() and take a look at http://master_url:4040/streaming/ and
see that I have an active Twitter receiver. Then I call
ssc.stop(stopSparkContext
= false, stopGracefully = true) and wait a bit, but the receiver seems to
stay active.

Is this expected? I’m running 1.0.1 on EC2.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Stopping-StreamingContext-does-not-kill-receiver-tp9522.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Nick Chammas
Looks like twitter4j  2.2.6 is what works,
but I don’t believe it’s documented anywhere.

Using 3.0.6 works for a while, but then causes the following error:

14/07/10 18:34:13 WARN ReceiverTracker: Error reported by receiver for
stream 0: Error in block pushing thread -
java.io.NotSerializableException: twitter4j.internal.json.ScopesImpl

4.0.2 doesn’t work at all and complains about not finding a StatusListener
class.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-version-of-twitter4j-should-I-use-with-Spark-Streaming-tp9352.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to RDD.take(middle 10 elements)

2014-07-10 Thread Nick Chammas
Interesting question on Stack Overflow:
http://stackoverflow.com/q/24677180/877069

Basically, is there a way to take() elements of an RDD at an arbitrary
index?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-RDD-take-middle-10-elements-tp9340.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Restarting a Streaming Context

2014-07-09 Thread Nick Chammas
So I do this from the Spark shell:

// set things up// 

ssc.start()
// let things happen for a few minutes

ssc.stop(stopSparkContext = false, stopGracefully = true)

Then I want to restart the Streaming Context:

ssc.start() // still in the shell; Spark Context is still alive

Which yields:

org.apache.spark.SparkException: StreamingContext has already been stopped

How come? Is there any way in the interactive shell to restart a Streaming
Context once it is stopped?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Restarting-a-Streaming-Context-tp9256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How should I add a jar?

2014-07-09 Thread Nick Chammas
I’m just starting to use the Scala version of Spark’s shell, and I’d like
to add in a jar I believe I need to access Twitter data live, twitter4j
. I’m confused over where and how to
add this jar in.

SPARK-1089  mentions two
environment variables, SPARK_CLASSPATH and ADD_JARS. SparkContext also has
an addJar method and a jars property, the latter of which does not have an
associated doc

.

What’s the difference between all these jar-related things, and what do I
need to do to add this Twitter jar in correctly?

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-should-I-add-a-jar-tp9224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

wholeTextFiles and gzip

2014-06-25 Thread Nick Chammas
Interesting question on Stack Overflow:
http://stackoverflow.com/questions/24402737/how-to-read-gz-files-in-spark-using-wholetextfiles

Is it possible to read gzipped files using wholeTextFiles()? Alternately,
is it possible to read the source file names using textFile()?
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-and-gzip-tp8283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Patterns for making multiple aggregations in one pass

2014-06-18 Thread Nick Chammas
The following is a simplified example of what I am trying to accomplish.

Say I have an RDD of objects like this:

{
"country": "USA",
"name": "Franklin",
"age": 24,
"hits": 224}
{

"country": "USA",
"name": "Bob",
"age": 55,
"hits": 108}
{

"country": "France",
"name": "Remi",
"age": 33,
"hits": 72}

I want to find the average age and total number of hits per country.
Ideally, I would like to scan the data once and perform both aggregations
simultaneously.

What is a good approach to doing this?

I’m thinking that we’d want to keyBy(country), and then somehow
reduceByKey(). The problem is, I don’t know how to approach writing a
function that can be passed to reduceByKey() and that will track a running
average and total simultaneously.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-tp7874.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark is now available via Homebrew

2014-06-18 Thread Nick Chammas
OS X / Homebrew users,

It looks like you can now download Spark simply by doing:

brew install apache-spark

I’m new to Homebrew, so I’m not too sure how people are intended to use
this. I’m guessing this would just be a convenient way to get the latest
release onto your workstation, and from there use spark-ec2 to launch
clusters.

Anyway, just a cool thing to point out.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-is-now-available-via-Homebrew-tp7856.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using Spark to crack passwords

2014-06-11 Thread Nick Chammas
Spark is obviously well-suited to crunching massive amounts of data. How
about to crunch massive amounts of numbers?

A few years ago I put together a little demo for some co-workers to
demonstrate the dangers of using SHA1
 to hash and store
passwords. Part of the demo included a live brute-forcing of hashes to show
how SHA1's speed made it unsuitable for hashing passwords.

I think it would be cool to redo the demo, but utilize the power of a
cluster managed by Spark to crunch through hashes even faster.

But how would you do that with Spark (if at all)?

I'm guessing you would create an RDD that somehow defined the search space
you're going to go through, and then partition it to divide the work up
equally amongst the cluster's cores. Does that sound right?

I wonder if others have already used Spark for computationally-intensive
workloads like this, as opposed to just data-intensive ones.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-crack-passwords-tp7437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

stage kill link is awfully close to the stage name

2014-06-06 Thread Nick Chammas
Minor point, but does anyone else find the new (and super helpful!) "kill"
link awfully close to the stage detail link in the 1.0.0 Web UI?

I think it would be better to have the kill link flush right, leaving a
large amount of space between it the stage detail link.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-kill-link-is-awfully-close-to-the-stage-name-tp7153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Showing key cluster stats in the Web UI

2014-06-06 Thread Nick Chammas
Someone correct me if this is wrong, but I believe 2 very important things
to know about your cluster are:

   1. How many cores does your cluster have available.
   2. How much memory does your cluster have available. (Perhaps this could
   be divided into total/in-use/free or something.)

Is this information easily available on the Web UI? Would it make sense to
add it in there in the environment overview page?

Continuing on that note, is it not also important to know what level of
parallelism your stages are running at? As in, how many concurrent tasks
are running for a given stage? If that is much lower than the number of
cores you have available, for example, that may be something obvious to
look into.

If so, showing the number of tasks running concurrently would be another
useful thing to add to the UI for the Stage detail page.

Does this make sense?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Showing-key-cluster-stats-in-the-Web-UI-tp7150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Subscribing to news releases

2014-05-30 Thread Nick Chammas
Is there a way to subscribe to news releases
? That would be swell.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Subscribing-to-news-releases-tp6592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Why Scala?

2014-05-29 Thread Nick Chammas
I recently discovered Hacker News and started reading through older posts
about Scala . It
looks like the language is fairly controversial on there, and it got me
thinking.

Scala appears to be the preferred language to work with in Spark, and Spark
itself is written in Scala, right?

I know that often times a successful project evolves gradually out of
something small, and that the choice of programming language may not always
have been made consciously at the outset.

But pretending that it was, why is Scala the preferred language of Spark?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Scala-tp6536.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to Unsubscribe from the Spark user list

2014-05-20 Thread Nick Chammas
Send an email to this address to unsubscribe from the Spark user list:

user-unsubscr...@spark.apache.org

Sending an email to the Spark user list itself (i.e. this list) *does not
do anything*, even if you put "unsubscribe" as the subject. We will all
just see your email.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Unsubscribe-from-the-Spark-user-list-tp6150.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Using Spark to analyze complex JSON

2014-05-20 Thread Nick Chammas
The Apache Drill  home page has an
interesting heading: "Liberate Nested Data".

Is there any current or planned functionality in Spark SQL or Shark to
enable SQL-like querying of complex JSON?

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-to-analyze-complex-JSON-tp6146.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-16 Thread Nick Chammas
I’m trying to do a simple count() on a large number of GZipped files in S3.
My job is failing with the following message:

14/05/15 19:12:37 WARN scheduler.TaskSetManager: Loss was due to
java.io.IOException
java.io.IOException: incorrect header check
at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
at 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
at java.io.InputStream.read(InputStream.java:101)



I traced this down to
HADOOP-5281,
but I’m not sure if 1) it’s the same issue, or 2) how to go about resolving
it.

I gather I need to update some Hadoop jar? Any tips on where to look/what
to do?

I’m running Spark on an EC2 cluster created by spark-ec2 with no special
options used.

Nick




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/count-ing-gz-files-gives-java-io-IOException-incorrect-header-check-tp5768.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.