Github user josephlijia commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-159197234
We have implemented a faster way by using zipPartition. But the final
results are packaged in RDD. When data volumes are huge, it is much faster than
it is now.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-158851260
@josephlijia this feature has moved into a Spark package. If you want to
file an issue report it's best to do it here:
Github user josephlijia commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-158647538
When we looked up one certain key-value by IndexedRDD, we found that it was
even slower than ordinary RDD. We use 100, keys in our experiment. When we
tested it
Github user tispratik commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-148489954
This is very interesting. Thanks for working on it. Hopefully it will be
out soon.
---
If your project is set up for it, you can reply to this email and have your
Github user zerosign commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-135373104
Hi Ankur,
Any update on this pull request ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user swethakasireddi commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-121435467
Hi Ankur,
Is this available in Spark 1.4.0 ? Also, can this be used in Spark
Streaming for lookups/updates/deletes based on key instead of having to
Github user josephlijia commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-115716387
When I want to update one value by one key using IndexedRDD, it only
re-creates one LeafNode. It is the cost of updating. Is it right?
---
If your project is set
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-115831041
@josephlijia For the old version of IndexedRDD (version 0.1), an update
recreates one LeafNode, plus all InternalNodes up to the root.
---
If your project is set up
Github user josephlijia commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-114842646
I met a question when I did some testings based on IndexedRDD. I compared
original RDD with IndexedRDD when looking up, updating, joining and deleting.
However, I
Github user adamnovak commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-114941993
I'm not sure this is the appropriate place to ask. Maybe make a new issue
on the IndexedRDD repo?
On Wed, Jun 24, 2015 at 4:52 AM, josephlijia
Github user josephlijia commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-110605232
Well, I found that getting is slower than putting by using IndexedRDD. But
getting should be faster than putting, is it right? I am expecting your reply.
Thanks a
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-110610103
@josephlijia Right, getting should generally be faster than putting.
However, for large batches of keys, multiget might be slower than multiput
because it currently
Github user josephlijia commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-110615752
Look at the code below:
def multiget(ks: Array[Id]): Map[Id, V] = {
val ksByPartition = ks.groupBy(k =
self.partitioner.get.getPartition(k))
val
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-110616280
It does send all keys to all partitions, because `ksByPartition` is
referenced in the closure passed to `context.runJob` and so is shipped in full
to all partitions.
Github user jason-dai commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-79776523
@jegonzal I wonder if you can share more details on your stack overflow
issue. We were considering a general fix (e.g., as I outlined in
Github user jegonzal commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-69504481
We should really address this stack overflow issue. Is there a JIRA we can
promote?
---
If your project is set up for it, you can reply to this email and have your
Github user octavian-ganea commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-69505333
Writing the RDD to disk from time to time is not a solution for me. Also
the second idea it's not good if I am doing random put and get ops. A common
usecase is
Github user ash211 commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-69521957
@jegonzal https://issues.apache.org/jira/browse/SPARK-4672 is relevant for
specifically GraphX encountering the stack overflow and has extensive
discussion, but I don't
Github user jegonzal commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-69531109
Hmm, we really need to elevate this to a full issue. I have run into the
stack overflow in MLlib (ALS) as well.
---
If your project is set up for it, you can reply
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-69475120
@octavian-ganea IndexedRDD creates a new lineage entry for each operation.
This enables fault tolerance but, as with other iterative Spark programs,
causes stack
Github user octavian-ganea commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-69365816
Thanks for the nice work!
I am trying to use this IndexedRDD as a distributed hash map and I would
like to be able to insert and update many entries (tens of
Github user ankurdave closed the pull request at:
https://github.com/apache/spark/pull/1297
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-68017172
IndexedRDD is now part of Spark Packages, so I'm closing this PR and have
moved it to a separate repository: https://github.com/amplab/spark-indexedrdd.
The
Github user nchammas commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-68019141
@ankurdave Does this mean IndexedRDD will not become part of Spark Core, or
is that still potentially happening in the near future?
---
If your project is set up for
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-68019859
@nchammas I don't think that's going to happen in the near future since the
interface and implementation are relatively unstable, but it could still happen
eventually.
Github user adamnovak commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-6365
Can it be in Spark 1.3? This sort of functionality would really help us get
a Spark-based implementation of the stuff that
@ga4gh/global-alliance-committers is doing
Github user bobbych commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-62686554
Firstly Thanks for the work!
i have one question?, does it support getPersistentRDDs ? use case is
reusing cached rdd, something along line of spark job server
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-62687768
@bobbych IndexedRDD handles persistence by caching its partitionsRDD, which
is the MapPartitionsRDD that you're getting back from sc.getPersistentRDDs. As
far as I
Github user pwais commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-62687253
Curious, will this ship in 1.2 ? (Also just want to ⤠for such a lovely
PR)
---
If your project is set up for it, you can reply to this email and have your
reply
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-57905819
This looks really interesting. Is there a blocker for supporting generic
keys (or at least say `String`), or is that a performance issue?
---
If your project is set up
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-57917807
@MLnick It's a slight performance issue, since we currently use
PrimitiveKeyOpenHashMap which optimizes for primitive keys by avoiding null
tracking, but I think the
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-57693369
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21218/consoleFull)
for PR 1297 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-57703581
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-57703566
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21218/consoleFull)
for PR 1297 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56926924
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20845/consoleFull)
for PR 1297 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56932588
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56932585
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20845/consoleFull)
for PR 1297 at commit
Github user markncooper commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56558807
Is it correct to assume that persist() is necessary otherwise the index
will get recreated each time it's used?
---
If your project is set up for it, you can reply
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56559155
@markncooper Yes, the IndexedRDD operations are implemented purely in terms
of Spark transformations, so they will get recomputed each time the result is
used unless
Github user ankurdave commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r17945448
--- Diff:
core/src/main/scala/org/apache/spark/rdd/IndexedRDDPartitionLike.scala ---
@@ -0,0 +1,426 @@
+/*
+ * Licensed to the Apache Software
Github user ankurdave commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r17945543
--- Diff:
core/src/main/scala/org/apache/spark/rdd/IndexedRDDPartitionLike.scala ---
@@ -0,0 +1,426 @@
+/*
+ * Licensed to the Apache Software
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56605365
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20734/consoleFull)
for PR 1297 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56610747
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20734/consoleFull)
for PR 1297 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56610754
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20734/
Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r17790219
--- Diff: core/src/main/scala/org/apache/spark/rdd/IndexedRDDLike.scala ---
@@ -0,0 +1,338 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF)
Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r17791303
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/ImmutableLongOpenHashSet.scala
---
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache
Github user squito commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56199798
This looks great! my comments are minor.
I know its early to be discussing example docs, but I just wanted to
mention that I can see caching being an area of
Github user markncooper commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-56236278
For what it's worth (and we are early on in our Spark usage) but we've
kicked the tires on this IndexedRDD and we love it. Thanks Ankur. We'll report
back with a
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-54382233
What's the status of this PR? Are we blocking on design review or
Spark/GraphX roadmap discussions?
---
If your project is set up for it, you can reply to this email
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-54383337
We've had a design review; the summary was that this design is good, though
we will eventually want to support alternative update mechanisms such as
log-structured
Github user ankurdave commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r14908709
--- Diff: core/src/main/scala/org/apache/spark/rdd/IndexedRDDLike.scala ---
@@ -0,0 +1,338 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48988645
QA tests have started for PR 1297. This patch merges cleanly. brView
progress:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16655/consoleFull
---
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r14799543
--- Diff: core/src/main/scala/org/apache/spark/rdd/IndexedRDD.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF)
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48417786
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48417780
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48418341
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48418352
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48418901
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48418910
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48419500
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48420477
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48420478
All automated tests passed.
Refer to this link for build results:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16435/
---
If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48421014
All automated tests passed.
Refer to this link for build results:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16436/
---
If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48421013
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48144983
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48144986
All automated tests passed.
Refer to this link for build results:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16361/
---
If your
Github user concretevitamin commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r14578164
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/ImmutableVector.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache
Github user concretevitamin commented on a diff in the pull request:
https://github.com/apache/spark/pull/1297#discussion_r14578170
--- Diff:
core/src/main/scala/org/apache/spark/util/collection/ImmutableVector.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache
Github user ankurdave commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48138509
@concretevitamin Thanks for the comments. I also found a way to simplify
the design by unifying `IndexedRDD(Partition)Like` and
`IndexedRDD(Partition)Ops` as you
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48140661
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48006656
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1297#issuecomment-48006657
All automated tests passed.
Refer to this link for build results:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16334/
---
If your
72 matches
Mail list logo