Re: Spark DataFrame UNPIVOT feature

2018-08-22 Thread Mike Hynes
Hi Reynold/Ivan, People familiar with pandas and R dataframes will likely have used the dataframe "melt" idiom, which is the functionality I believe you are referring to: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html I have had to write this function myself in my own wor

Re: RDD.broadcast

2016-04-28 Thread Mike Hynes
I second knowing the use case for interest. I can imagine a case where knowledge of the RDD key distribution would help local computations, for relaticely few keys, but would be interested to hear your motive. Essentially, are you trying to achieve what would be an all-reduce type operation in MPI

Re: executor delay in Spark

2016-04-24 Thread Mike Hynes
the partitioning is even (happens when count is moved). > > Any pointers in figuring out this issue is much appreciated. > > Regards, > Raghava. > > > > > On Fri, Apr 22, 2016 at 7:40 PM, Mike Hynes <91m...@gmail.com> wrote: > >> Glad to hear that th

Re: executor delay in Spark

2016-04-22 Thread Mike Hynes
it) at a > later stage also. > > Apart from introducing a dummy stage or running it from spark-shell, is > there any other option to fix this? > > Regards, > Raghava. > > > On Mon, Apr 18, 2016 at 12:17 AM, Mike Hynes <91m...@gmail.com> wrote: > >> When

Re: RDD Partitions not distributed evenly to executors

2016-04-06 Thread Mike Hynes
t; > > On Mon, Apr 4, 2016 at 10:57 PM, Koert Kuipers wrote: > >> can you try: >> spark.shuffle.reduceLocality.enabled=false >> >> On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote: >> >>> Dear all, >>> >>>

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
f anyone else has any other ideas or experience, please let me know. Mike On 4/4/16, Koert Kuipers wrote: > we ran into similar issues and it seems related to the new memory > management. can you try: > spark.memory.useLegacyMode = true > > On Mon, Apr 4, 2016 at 9:12 AM,

RDD Partitions not distributed evenly to executors

2016-04-04 Thread Mike Hynes
[ CC'ing dev list since nearly identical questions have occurred in user list recently w/o resolution; c.f.: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-sing

Re: sbt publish-local fails with 2.0.0-SNAPSHOT

2016-02-01 Thread Mike Hynes
A ( > https://issues.apache.org/jira/browse/SPARK-13109) to track this. > > > On Mon, Feb 1, 2016 at 3:01 PM, Mike Hynes <91m...@gmail.com> wrote: > >> Hi devs, >> >> I used to be able to do some local development from the upstream >> master branch and run the

sbt publish-local fails with 2.0.0-SNAPSHOT

2016-01-31 Thread Mike Hynes
Hi devs, I used to be able to do some local development from the upstream master branch and run the publish-local command in an sbt shell to publish the modified jars to the local ~/.ivy2 repository. I relied on this behaviour, since I could write other local packages that had my local 1.X.0-SNAP

Re: Gradient Descent with large model size

2015-10-19 Thread Mike Hynes
Hi Alexander, Joseph, Evan, I just wanted to weigh in an empirical result that we've had on a standalone cluster with 16 nodes and 256 cores. Typically we run optimization tasks with 256 partitions for 1 partition per core, and find that performance worsens with more partitions than physical core

Re: No speedup in MultiLayerPerceptronClassifier with increase in number of cores

2015-10-11 Thread Mike Hynes
Having only 2 workers for 5 machines would be your problem: you probably want 1 worker per physical machine, which entails running the spark-daemon.sh script to start a worker on those machines. The partitioning is agnositic to how many executors are available for running the tasks, so you can't do

Re: treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
the > last portion this could really make a difference. > > On Sat, Sep 26, 2015 at 10:20 AM, Mike Hynes <91m...@gmail.com> wrote: > >> Hi Evan, >> >> (I just realized my initial email was a reply to the wrong thread; I'm >> very sorry about this). &

treeAggregate timing / SGD performance with miniBatchFraction < 1

2015-09-26 Thread Mike Hynes
hings like > task serialization and other platform overheads. You've got to balance how > much computation you want to do vs. the amount of time you want to spend > waiting for the platform. > > - Evan > > On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91m...@gmail.com&g

Re: RDD API patterns

2015-09-26 Thread Mike Hynes
Hello Devs, This email concerns some timing results for a treeAggregate in computing a (stochastic) gradient over an RDD of labelled points, as is currently done in the MLlib optimization routine for SGD. In SGD, the underlying RDD is downsampled by a fraction f \in (0,1], and the subgradients ov

Re: OOM in spark driver

2015-09-02 Thread Mike Hynes
Just a thought; this has worked for me before on standalone client with a similar OOM error in a driver thread. Try setting: export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your machine in your environment/spark-env.sh before running spark-submit. Mike On 9/2/15, ankit tyagi wro

Re: Broadcast variable of size 1 GB fails with negative memory exception

2015-07-29 Thread Mike Hynes
ast, but > I think it might just work as long as you stick with TorrentBroadcast. > > imran > > On Tue, Jul 28, 2015 at 10:56 PM, Mike Hynes <91m...@gmail.com> wrote: > >> Hi Imran, >> >> Thanks for your reply. I have double-checked the code I ran to &g

Re: Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Mike Hynes
it fails at 1 << 28 with nearly the same message, but its fine for (1 << > 28) - 1 with a reported block size of 2147483680. Not exactly the same as > what you did, but I expect it to be close enough to exhibit the same error. > > > On Tue, Jul 28, 2015 at 12:3

Broadcast variable of size 1 GB fails with negative memory exception

2015-07-28 Thread Mike Hynes
Hello Devs, I am investigating how matrix vector multiplication can scale for an IndexedRowMatrix in mllib.linalg.distributed. Currently, I am broadcasting the vector to be multiplied on the right. The IndexedRowMatrix is stored across a cluster with up to 16 nodes, each with >200 GB of memory. T

Re: Questions about Fault tolerance of Spark

2015-07-10 Thread MIKE HYNES
Gentle bump on this topic; how to test the fault tolerance and previous benchmark results are both things we are interested in as well.  Mike Original message From: 牛兆捷 Date:07-09-2015 04:19 (GMT-05:00) To: dev@spark.apache.org, u...@spark.apache.org Subject: Questions abou

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-10 Thread Mike Hynes
out more requests, trying to > balance how much data needs to be buffered vs. preventing any waiting on > remote reads (which can be controlled by spark.reducer.maxSizeInFlight). > > Hope that clarifies things! > > btw, you sent this last question to just me -- I think its a good question

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-09 Thread Mike Hynes
Ahhh---forgive my typo: what I mean is, (t2 - t1) >= (t_ser + t_deser + t_exec) is satisfied, empirically. On 6/10/15, Mike Hynes <91m...@gmail.com> wrote: > Hi Imran, > > Thank you for your email. > > In examing the condition (t2 - t1) < (t_ser + t_deser + t_exec), I

Re: Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-09 Thread Mike Hynes
iting* for network transfer. It could > be that there is no (measurable) wait time b/c the next blocks are fetched > before they are needed. Shuffle writes occur in the normal task execution > thread, though, so we (try to) measure all of it. > > > On Sun, Jun 7, 2015 at 11:12 PM, Mik

Stages with non-arithmetic numbering & Timing metrics in event logs

2015-06-07 Thread Mike Hynes
ars in the Spark UI is an actual stage, so if > you see ID's in there, but they are not in the logs, then let us know > (that would be a bug). > > - Patrick > > On Sun, Jun 7, 2015 at 9:06 AM, Akhil Das > wrote: >> Are you seeing the same behavior on the driver UI? (t

Scheduler question: stages with non-arithmetic numbering

2015-06-05 Thread Mike Hynes
Hi folks, When I look at the output logs for an iterative Spark program, I see that the stage IDs are not arithmetically numbered---that is, there are gaps between stages and I might find log information about Stage 0, 1,2, 5, but not 3 or 4. As an example, the output from the Spark logs below sh

Re: Spark config option 'expression language' feedback request

2015-03-31 Thread Mike Hynes
Hi, This is just a thought from my experience setting up Spark to run on a linux cluster. I found it a bit unusual that some parameters could be specified as command line args to spark-submit, others as env variables, and some in a configuration file. What I ended up doing was writing my own bash s

Re: [ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Mike Hynes
ar command show? are you > sure you don't have JRE 7 but JDK 6 installed? > > On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes <91m...@gmail.com> wrote: >> ./bin/compute-classpath.sh fails with error: >> >> $> jar -tf >> assembly/target/scala-2.10/spar

[ERROR] bin/compute-classpath.sh: fails with false positive test for java 1.7 vs 1.6

2015-02-24 Thread Mike Hynes
./bin/compute-classpath.sh fails with error: $> jar -tf assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar nonexistent/class/path java.util.zip.ZipException: invalid CEN header (bad signature) at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipF