Hi Reynold/Ivan,
People familiar with pandas and R dataframes will likely have used the
dataframe "melt" idiom, which is the functionality I believe you are
referring to:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
I have had to write this function myself in my own wor
I second knowing the use case for interest. I can imagine a case where
knowledge of the RDD key distribution would help local computations, for
relaticely few keys, but would be interested to hear your motive.
Essentially, are you trying to achieve what would be an all-reduce type
operation in MPI
the partitioning is even (happens when count is moved).
>
> Any pointers in figuring out this issue is much appreciated.
>
> Regards,
> Raghava.
>
>
>
>
> On Fri, Apr 22, 2016 at 7:40 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Glad to hear that th
it) at a
> later stage also.
>
> Apart from introducing a dummy stage or running it from spark-shell, is
> there any other option to fix this?
>
> Regards,
> Raghava.
>
>
> On Mon, Apr 18, 2016 at 12:17 AM, Mike Hynes <91m...@gmail.com> wrote:
>
>> When
t;
>
> On Mon, Apr 4, 2016 at 10:57 PM, Koert Kuipers wrote:
>
>> can you try:
>> spark.shuffle.reduceLocality.enabled=false
>>
>> On Mon, Apr 4, 2016 at 8:17 PM, Mike Hynes <91m...@gmail.com> wrote:
>>
>>> Dear all,
>>>
>>>
f anyone else has any other ideas or experience, please let me know.
Mike
On 4/4/16, Koert Kuipers wrote:
> we ran into similar issues and it seems related to the new memory
> management. can you try:
> spark.memory.useLegacyMode = true
>
> On Mon, Apr 4, 2016 at 9:12 AM,
[ CC'ing dev list since nearly identical questions have occurred in
user list recently w/o resolution;
c.f.:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html
http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-sing
A (
> https://issues.apache.org/jira/browse/SPARK-13109) to track this.
>
>
> On Mon, Feb 1, 2016 at 3:01 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Hi devs,
>>
>> I used to be able to do some local development from the upstream
>> master branch and run the
Hi devs,
I used to be able to do some local development from the upstream
master branch and run the publish-local command in an sbt shell to
publish the modified jars to the local ~/.ivy2 repository.
I relied on this behaviour, since I could write other local packages
that had my local 1.X.0-SNAP
Hi Alexander, Joseph, Evan,
I just wanted to weigh in an empirical result that we've had on a
standalone cluster with 16 nodes and 256 cores.
Typically we run optimization tasks with 256 partitions for 1
partition per core, and find that performance worsens with more
partitions than physical core
Having only 2 workers for 5 machines would be your problem: you
probably want 1 worker per physical machine, which entails running the
spark-daemon.sh script to start a worker on those machines.
The partitioning is agnositic to how many executors are available for
running the tasks, so you can't do
the
> last portion this could really make a difference.
>
> On Sat, Sep 26, 2015 at 10:20 AM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Hi Evan,
>>
>> (I just realized my initial email was a reply to the wrong thread; I'm
>> very sorry about this).
&
hings like
> task serialization and other platform overheads. You've got to balance how
> much computation you want to do vs. the amount of time you want to spend
> waiting for the platform.
>
> - Evan
>
> On Sat, Sep 26, 2015 at 9:27 AM, Mike Hynes <91m...@gmail.com&g
Hello Devs,
This email concerns some timing results for a treeAggregate in
computing a (stochastic) gradient over an RDD of labelled points, as
is currently done in the MLlib optimization routine for SGD.
In SGD, the underlying RDD is downsampled by a fraction f \in (0,1],
and the subgradients ov
Just a thought; this has worked for me before on standalone client
with a similar OOM error in a driver thread. Try setting:
export SPARK_DAEMON_MEMORY=4G #or whatever size you can afford on your machine
in your environment/spark-env.sh before running spark-submit.
Mike
On 9/2/15, ankit tyagi wro
ast, but
> I think it might just work as long as you stick with TorrentBroadcast.
>
> imran
>
> On Tue, Jul 28, 2015 at 10:56 PM, Mike Hynes <91m...@gmail.com> wrote:
>
>> Hi Imran,
>>
>> Thanks for your reply. I have double-checked the code I ran to
&g
it fails at 1 << 28 with nearly the same message, but its fine for (1 <<
> 28) - 1 with a reported block size of 2147483680. Not exactly the same as
> what you did, but I expect it to be close enough to exhibit the same error.
>
>
> On Tue, Jul 28, 2015 at 12:3
Hello Devs,
I am investigating how matrix vector multiplication can scale for an
IndexedRowMatrix in mllib.linalg.distributed.
Currently, I am broadcasting the vector to be multiplied on the right.
The IndexedRowMatrix is stored across a cluster with up to 16 nodes,
each with >200 GB of memory. T
Gentle bump on this topic; how to test the fault tolerance and previous
benchmark results are both things we are interested in as well.
Mike
Original message From: 牛兆捷
Date:07-09-2015 04:19 (GMT-05:00)
To: dev@spark.apache.org, u...@spark.apache.org Subject:
Questions abou
out more requests, trying to
> balance how much data needs to be buffered vs. preventing any waiting on
> remote reads (which can be controlled by spark.reducer.maxSizeInFlight).
>
> Hope that clarifies things!
>
> btw, you sent this last question to just me -- I think its a good question
Ahhh---forgive my typo: what I mean is,
(t2 - t1) >= (t_ser + t_deser + t_exec)
is satisfied, empirically.
On 6/10/15, Mike Hynes <91m...@gmail.com> wrote:
> Hi Imran,
>
> Thank you for your email.
>
> In examing the condition (t2 - t1) < (t_ser + t_deser + t_exec), I
iting* for network transfer. It could
> be that there is no (measurable) wait time b/c the next blocks are fetched
> before they are needed. Shuffle writes occur in the normal task execution
> thread, though, so we (try to) measure all of it.
>
>
> On Sun, Jun 7, 2015 at 11:12 PM, Mik
ars in the Spark UI is an actual stage, so if
> you see ID's in there, but they are not in the logs, then let us know
> (that would be a bug).
>
> - Patrick
>
> On Sun, Jun 7, 2015 at 9:06 AM, Akhil Das
> wrote:
>> Are you seeing the same behavior on the driver UI? (t
Hi folks,
When I look at the output logs for an iterative Spark program, I see
that the stage IDs are not arithmetically numbered---that is, there
are gaps between stages and I might find log information about Stage
0, 1,2, 5, but not 3 or 4.
As an example, the output from the Spark logs below sh
Hi,
This is just a thought from my experience setting up Spark to run on a
linux cluster. I found it a bit unusual that some parameters could be
specified as command line args to spark-submit, others as env variables,
and some in a configuration file. What I ended up doing was writing my own
bash s
ar command show? are you
> sure you don't have JRE 7 but JDK 6 installed?
>
> On Tue, Feb 24, 2015 at 11:02 PM, Mike Hynes <91m...@gmail.com> wrote:
>> ./bin/compute-classpath.sh fails with error:
>>
>> $> jar -tf
>> assembly/target/scala-2.10/spar
./bin/compute-classpath.sh fails with error:
$> jar -tf
assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar
nonexistent/class/path
java.util.zip.ZipException: invalid CEN header (bad signature)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipF
27 matches
Mail list logo