[
https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1536:
-
Assignee: Manish Amde
Add multiclass classification support to MLlib
[
https://issues.apache.org/jira/browse/SPARK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1546:
-
Assignee: Manish Amde
Add AdaBoost algorithm to Spark MLlib
[
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1547:
-
Assignee: Manish Amde
Add gradient boosting algorithm to MLlib
[
https://issues.apache.org/jira/browse/SPARK-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1544:
-
Assignee: Manish Amde
Add support for creating deep decision trees
[
https://issues.apache.org/jira/browse/SPARK-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1545:
-
Assignee: Manish Amde
Add Random Forest algorithm to MLlib
[
https://issues.apache.org/jira/browse/SPARK-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1535:
-
Assignee: Tor Myklebust
jblas's DoubleMatrix(double[]) ctor creates garbage; avoid
[
https://issues.apache.org/jira/browse/SPARK-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1535.
--
Resolution: Fixed
Fix Version/s: 1.0.0
jblas's DoubleMatrix(double[]) ctor creates
Matei Zaharia created SPARK-1540:
Summary: Investigate whether we should require keys in PairRDD to
be Comparable
Key: SPARK-1540
URL: https://issues.apache.org/jira/browse/SPARK-1540
Project: Spark
The problem is that groupByKey means “bring all the points with this same key
to the same JVM”. Your input is a Seq[Point], so you have to have all the
points there. This means that a) all points will be sent across the network in
a cluster, which is slow (and Spark goes through this sending
[
https://issues.apache.org/jira/browse/SPARK-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1462.
--
Resolution: Fixed
Fix Version/s: 1.0.0
Examples of ML algorithms are using deprecated
Hi Bertrand,
We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that
will allow saving pickled objects. Unfortunately this is not in yet, but there
is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161.
In 1.0, one feature we do have now is the
Yup, one reason it’s 2 actually is to give people a similar experience to
working with large files, in case their code doesn’t deal well with the file
being partitioned.
Matei
On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote:
Take a look at the minSplits argument for
Yes, both things can happen. Take a look at
http://spark.apache.org/docs/latest/job-scheduling.html, which includes
scheduling concurrent jobs within the same driver.
Matei
On Apr 15, 2014, at 4:08 PM, Ian Ferreira ianferre...@hotmail.com wrote:
What is the support for multi-tenancy in
You can use mapPartitionsWithIndex and look at the partition index (0 will be
the first partition) to decide whether to skip the first line.
Matei
On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote:
We have similar needs but IIRC, I came to the conclusion that this would only
Spark can actually launch multiple executors on the same node if you configure
it that way, but if you haven’t done that, this might mean that some tasks are
reading data from the cache, and some from HDFS. (In the HDFS case Spark will
only report it as NODE_LOCAL since HDFS isn’t tied to a
Kryo won’t make a major impact on PySpark because it just stores data as byte[]
objects, which are fast to serialize even with Java. But it may be worth a try
— you would just set spark.serializer and not try to register any classes. What
might make more impact is storing data as
Matei Zaharia created SPARK-1484:
Summary: MLlib should warn if you are using an iterative algorithm
on non-cached data
Key: SPARK-1484
URL: https://issues.apache.org/jira/browse/SPARK-1484
Project
Matei Zaharia created SPARK-1481:
Summary: Add Naive Bayes to MLlib documentation
Key: SPARK-1481
URL: https://issues.apache.org/jira/browse/SPARK-1481
Project: Spark
Issue Type: Sub-task
[
https://issues.apache.org/jira/browse/SPARK-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1241.
--
Resolution: Fixed
Fix Version/s: 1.0.0
Support sliding in RDD
[
https://issues.apache.org/jira/browse/SPARK-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966987#comment-13966987
]
Matei Zaharia commented on SPARK-1225:
--
Included in https://github.com/apache/spark
[
https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967196#comment-13967196
]
Matei Zaharia commented on SPARK-1355:
--
I have to say this was pretty good, I avoided
, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Matei,
Where is the functionality in 0.9 to spill data within a task (separately
from persist)? My apologies if this is something obvious but I don't see it
in the api docs.
-Suren
On Thu, Apr 10, 2014 at 3:59 PM, Matei Zaharia
[
https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1428.
--
Resolution: Fixed
Fix Version/s: 1.0.0
MLlib should convert non-float64 NumPy arrays
Matei Zaharia created SPARK-1467:
Summary: Make StorageLevel.apply() factory methods experimental
Key: SPARK-1467
URL: https://issues.apache.org/jira/browse/SPARK-1467
Project: Spark
Issue
I haven’t seen this but it may be a bug in Typesafe Config, since this is
serializing a Config object. We don’t actually use Typesafe Config ourselves.
Do you have any nulls in the data itself by any chance? And do you know how
that Config object is getting there?
Matei
On Apr 9, 2014, at
To add onto the discussion about memory working space, 0.9 introduced the
ability to spill data within a task to disk, and in 1.0 we’re also changing the
interface to allow spilling data within the same *group* to disk (e.g. when you
do groupBy and get a key with lots of values). The main
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package
you added on the PYTHONPATH? How did you set the path, was it in
conf/spark-env.sh?
Matei
On Apr 10, 2014, at 7:39 AM, aazout albert.az...@velos.io wrote:
I am getting a python ImportError on Spark standalone
, Chen Chao,
Christian Lundgren, Diana Carroll, Emtiaz Ahmed, Frank Dai,
Henry Saputra, jianghan, Josh Rosen, Jyotiska NK, Kay Ousterhout,
Kousuke Saruta, Mark Grover, Matei Zaharia, Nan Zhu, Nick Lanham,
Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang,
Raymond Liu, Reynold Xin, Sandy
)
at
org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85)
On Thu, Apr 3, 2014 at 8:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Cool, thanks for the update. Have you tried running a branch with this fix
(e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak
issue are you
[
https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia reassigned SPARK-1428:
Assignee: Matei Zaharia
MLlib should convert non-float64 NumPy arrays to float64 instead
[
https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1428:
-
Assignee: Sandeep Singh (was: Matei Zaharia)
MLlib should convert non-float64 NumPy arrays
[
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962026#comment-13962026
]
Matei Zaharia commented on SPARK-1021:
--
Note that if we do this, we'll need a similar
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them
here:
https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
Matei
On Apr 7, 2014, at 9:45 PM,
The off-heap storage level is currently tied to Tachyon, but it might support
other forms of off-heap storage later. However it’s not really designed to be
mixed with the other ones. For this use case you may want to rely on memory
locality and have some custom code to push the data to the
Matei Zaharia created SPARK-1423:
Summary: Add scripts for launching Spark on Windows Azure
Key: SPARK-1423
URL: https://issues.apache.org/jira/browse/SPARK-1423
Project: Spark
Issue Type
Matei Zaharia created SPARK-1422:
Summary: Add scripts for launching Spark on Google Compute Engine
Key: SPARK-1422
URL: https://issues.apache.org/jira/browse/SPARK-1422
Project: Spark
Issue
[
https://issues.apache.org/jira/browse/SPARK-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961283#comment-13961283
]
Matei Zaharia commented on SPARK-1424:
--
More generally we should have flags
[
https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1421:
-
Description: Currently it requires Python 2.7 because it uses some new
APIs, but they should
[
https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1421:
-
Summary: Make MLlib work on Python 2.6 (was: Make MLlib work on Python 2.6
and NumPy 1.7
Matei Zaharia created SPARK-1426:
Summary: Make MLlib work with NumPy versions older than 1.7
Key: SPARK-1426
URL: https://issues.apache.org/jira/browse/SPARK-1426
Project: Spark
Issue Type
[
https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia reassigned SPARK-1421:
Assignee: Matei Zaharia
Make MLlib work on Python 2.6
[
https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1421.
--
Resolution: Fixed
Fix Version/s: 0.9.2
1.0.0
Make MLlib work
[
https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1133.
--
Resolution: Fixed
Fix Version/s: 1.0.0
Add a new small files input for MLlib, which
[
https://issues.apache.org/jira/browse/SPARK-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia reassigned SPARK-1414:
Assignee: Matei Zaharia
Python API for SparkContext.wholeTextFiles
Matei Zaharia created SPARK-1416:
Summary: Add support for SequenceFiles in PySpark
Key: SPARK-1416
URL: https://issues.apache.org/jira/browse/SPARK-1416
Project: Spark
Issue Type
[
https://issues.apache.org/jira/browse/SPARK-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1198.
--
Resolution: Fixed
Fix Version/s: 1.0.0
Allow pipes tasks to run in different sub
Exceptions should be sent back to the driver program and logged there (with a
SparkException thrown if a task fails more than 4 times), but there were some
bugs before where this did not happen for non-Serializable exceptions. We
changed it to pass back the stack traces only (as text), which
, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
BTW one other thing — in your experience, Diana, which non-text
InputFormats would be most useful to support in Python first? Would it be
Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or
something
This can’t be done through the script right now, but you can do it manually as
long as the cluster is stopped. If the cluster is stopped, just go into the AWS
Console, right click a slave and choose “launch more of these” to add more. Or
select multiple slaves and delete them. When you run
As long as the filesystem is mounted at the same path on every node, you should
be able to just run Spark and use a file:// URL for your files.
The only downside with running it this way is that Lustre won’t expose data
locality info to Spark, the way HDFS does. That may not matter if it’s a
[
https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1162:
-
Fix Version/s: 0.9.2
1.0.0
Add top() and takeOrdered() to PySpark
[
https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1162.
--
Resolution: Fixed
Add top() and takeOrdered() to PySpark
[
https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1162:
-
Assignee: Prashant Sharma (was: prashant)
Add top() and takeOrdered() to PySpark
[
https://issues.apache.org/jira/browse/SPARK-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1333.
--
Resolution: Fixed
Java API for running SQL queries
[
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1134:
-
Affects Version/s: 0.9.1
ipython won't run standalone python script
[
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1134.
--
Resolution: Fixed
ipython won't run standalone python script
[
https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1134:
-
Fix Version/s: 0.9.2
1.0.0
ipython won't run standalone python script
[
https://issues.apache.org/jira/browse/SPARK-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1296:
-
Fix Version/s: (was: 1.0.0)
Make RDDs Covariant
---
Key
Matei Zaharia created SPARK-1413:
Summary: Parquet messes up stdout and stdin when used in Spark REPL
Key: SPARK-1413
URL: https://issues.apache.org/jira/browse/SPARK-1413
Project: Spark
Hey Bhaskar, this is still the plan, though QAing might take longer than 15
days. Right now since we’ve passed April 1st, the only features considered for
a merge are those that had pull requests in review before. (Some big ones are
things like annotating the public APIs and simplifying
, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hey Steve,
This configuration sounds pretty good. The one thing I would consider is
having more disks, for two reasons — Spark uses the disks for large shuffles
and out-of-core operations, and often it’s better to run HDFS or your
[
https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1391:
-
Assignee: Min Zhou
BlockManager cannot transfer blocks larger than 2G in size
[
https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated SPARK-1133:
-
Assignee: Xusen Yin
Add a new small files input for MLlib, which will return an RDD[(fileName
You could probably port it back, but it required some changes on the Java side
as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues
with 0.9.
Matei
On Apr 1, 2014, at 8:53 AM, Ian Ferreira ianferre...@hotmail.com wrote:
Hi there,
For some reason the
Hi Manoj,
At the current time, for drop-in replacement of Hive, it will be best to stick
with Shark. Over time, Shark will use the Spark SQL backend, but should remain
deployable the way it is today (including launching the SharkServer, using the
Hive CLI, etc). Spark SQL is better for
worth the
annoyance of everyone needing to download a new version of Scala, making
yet another version of the AMIs, etc.
-Kay
On Thu, Mar 27, 2014 at 4:33 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Sounds good. Feel free to send a PR even though it's a small change (it
leads
Weird, how exactly are you pulling out the sample? Do you have a small program
that reproduces this?
Matei
On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.
Sounds good. Feel free to send a PR even though it’s a small change (it leads
to better Git history and such).
Matei
On Mar 27, 2014, at 4:15 PM, Mark Hamstra m...@clearstorydata.com wrote:
FYI, Spark master does build cleanly and the tests do run successfully with
Scala version set to
exceptions, but I think they all stem from the above,
eg. org.apache.spark.SparkException: Error sending message to
BlockManagerMaster
Let me know if there are other settings I should try, or if I should
try a newer snapshot.
Thanks again!
On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia
Congrats Michael co for putting this together — this is probably the neatest
piece of technology added to Spark in the past few months, and it will greatly
change what users can do as more data sources are added.
Matei
On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com
wrote:
Much thanks, I suspected this would be difficult. I was hoping to
generate some 4 degrees of separation-like statistics. Looks like
I'll just have to work with a subset of my graph.
On Wed, Mar 26, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
All-pairs distances
+1 looks good to me. I tried both the source and CDH4 versions and looked at
the new streaming docs.
The release notes seem slightly incomplete, but I guess you’re still working on
them? Anyway those don’t go into the release tarball so it’s okay.
Matei
On Mar 24, 2014, at 2:01 PM, Tathagata
it on a Linux cluster.
I opened https://spark-project.atlassian.net/browse/SPARK-1326 to track it. We
can put it in another RC if we find bigger issues.
Matei
On Mar 25, 2014, at 10:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
+1 looks good to me. I tried both the source and CDH4 versions
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your
SparkContext? It tries to serialize that many objects together at a time, which
might be too much. By default the batchSize is 1024.
Matei
On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi
Congrats Michael and all for getting this so far. Spark SQL and Catalyst will
make it much easier to use structured data in Spark, and open the door for some
very cool extensions later.
Matei
On Mar 20, 2014, at 11:15 PM, Heiko Braun ike.br...@googlemail.com wrote:
Congrats! That's a really
Try passing the shuffle=true parameter to coalesce, then it will do the map in
parallel but still pass all the data through one reduce node for writing it
out. That’s probably the fastest it will get. No need to cache if you do that.
Matei
On Mar 21, 2014, at 4:04 PM, Aureliano Buendia
, at 5:01 PM, Aureliano Buendia buendia...@gmail.com wrote:
Good to know it's as simple as that! I wonder why shuffle=true is not the
default for coalesce().
On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Try passing the shuffle=true parameter to coalesce
-Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*. I'm
not sure how this varies from 0.9.0 release, but it seems to work on
SNAPSHOT.
On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Try checking spark-env.sh on the workers as well. Maybe code there is
somehow
Hi Adrian,
On every timestep of execution, we receive new data, then report updated word
counts for that new data plus the past 30 seconds. The latency here is about
how quickly you get these updated counts once the new batch of data comes in.
It’s true that the count reflects some data from
at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hey Nick, I’m curious, have you been doing any further development on this?
It would be good to get expanded InputFormat support in Spark 1.0. To start
with we don’t have to do SequenceFiles in particular, we can do stuff like
Avro
Try checking spark-env.sh on the workers as well. Maybe code there is somehow
overriding the spark.executor.memory setting.
Matei
On Mar 18, 2014, at 6:17 PM, Jim Blomo jim.bl...@gmail.com wrote:
Hello, I'm using the Github snapshot of PySpark and having trouble setting
the worker memory
Yes, Spark automatically removes old RDDs from the cache when you make new
ones. Unpersist forces it to remove them right away. In both cases though, note
that Java doesn’t garbage-collect the objects released until later.
Matei
On Mar 19, 2014, at 7:22 PM, Nicholas Chammas
Hey Nick, I’m curious, have you been doing any further development on this? It
would be good to get expanded InputFormat support in Spark 1.0. To start with
we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if
it’s easy to read in Python) or some kind of
I just meant that you call union() before creating the RDDs that you pass to
new Graph(). If you call it after it will produce other RDDs.
The Graph() constructor actually shuffles and “indexes” the data to make graph
operations efficient, so it’s not too easy to add elements after. You could
Hi Diana,
Non-text input formats are only supported in Java and Scala right now, where
you can use sparkContext.hadoopFile or .hadoopDataset to load data with any
InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only
have textFile, which gives you one record per line.
to me how to do that as I
probably should be.
Thanks,
Diana
On Mon, Mar 17, 2014 at 1:02 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi Diana,
Non-text input formats are only supported in Java and Scala right now, where
you can use sparkContext.hadoopFile or .hadoopDataset
Yup, it only returns each value once.
Matei
On Mar 17, 2014, at 1:14 PM, Adrian Mocanu amoc...@verticalscope.com wrote:
Hi
Quick question here,
I know that .foreach is not idempotent. I am wondering if collect() is
idempotent? Meaning that once I’ve collect()-ed if spark node crashes I
)
On Mon, Mar 17, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Here’s an example of getting together all lines in a file as one string:
$ cat dir/a.txt
Hello
world!
$ cat dir/b.txt
What's
up??
$ bin/pyspark
files = sc.textFile(“dir”)
files.collect()
[u'Hello
Thanks for reporting this, looking into it.
On Mar 17, 2014, at 2:44 PM, Walrus theCat walrusthe...@gmail.com wrote:
ping
On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson ilike...@gmail.com wrote:
Looks like everything from 0.8.0 and before errors similarly (though Spark
0.3 for Scala
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote:
I also found that the product and user RDDs were being rebuilt many times
over in my tests, even for tiny data sets. By persisting the RDD returned
from updateFeatures() I was able to avoid a raft of duplicate computations.
Is
If it’s a driver on the cluster, please open a JIRA issue about this — this
kill command is indeed intended to work.
Matei
On Mar 16, 2014, at 2:35 PM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Are you embedding your driver inside the cluster?
If not then that command will not kill the
Thanks, I’ve added you:
https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Let me know
if you want to change any wording.
Matei
On Mar 16, 2014, at 6:48 AM, Egor Pahomov pahomov.e...@gmail.com wrote:
Hi, page https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
I like the pom-reader approach as well — in particular, that it lets you add
extra stuff in your SBT build after loading the dependencies from the POM.
Profiles would be the one thing missing to be able to pass options through.
Matei
On Mar 14, 2014, at 10:03 AM, Patrick Wendell
Since it’s from Scala, it might mean you’re running with a different version of
Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while
Spark 0.9 uses Scala 2.10.
Matei
On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia)
arockia.r.jeya...@verizon.com wrote:
Hi,
Thanks, added you.
On Mar 11, 2014, at 2:47 AM, Christoph Böhm listenbru...@gmx.net wrote:
Dear Spark team,
thanks for the great work and congrats on becoming an Apache top-level
project!
You could add us to your Powered-by-page, because we are using Spark (and
Shark) to perform
I agree that we can’t keep adding these to the core API, partly because it will
get unwieldy to maintain and partly just because each storage system will bring
in lots of dependencies. We can simply have helper classes in different modules
for each storage system. There’s some discussion on
Hi Dana,
It’s hard to tell exactly what is consuming time, but I’d suggest starting by
profiling the single application first. Three things to look at there:
1) How many stages and how many tasks per stage is Spark launching (in the
application web UI at http://driver:4040)? If you have
Xt*X should mathematically always be positive semi-definite, so the only way
this might be bad is if it’s not invertible due to linearly dependent rows.
This might happen due to the initialization or possibly due to numerical
issues, though it seems unlikely. Maybe it also happens if some users
definite.
--
Sean Owen | Director, Data Science | London
On Thu, Mar 6, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Xt*X should mathematically always be positive semi-definite, so the only way
this might be bad is if it’s not invertible due to linearly dependent rows
,
If the data has linearly dependent rows ALS should have a failback
mechanism. Either remove the rows and then call BLAS posv or call BLAS gesv
or Breeze QR decomposition.
I can share the analysis over email.
Thanks.
Deb
On Thu, Mar 6, 2014 at 9:39 AM, Matei Zaharia matei.zaha
1001 - 1100 of 2046 matches
Mail list logo