[jira] [Updated] (SPARK-1536) Add multiclass classification support to MLlib

2014-04-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1536: - Assignee: Manish Amde Add multiclass classification support to MLlib

[jira] [Updated] (SPARK-1546) Add AdaBoost algorithm to Spark MLlib

2014-04-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1546: - Assignee: Manish Amde Add AdaBoost algorithm to Spark MLlib

[jira] [Updated] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-04-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1547: - Assignee: Manish Amde Add gradient boosting algorithm to MLlib

[jira] [Updated] (SPARK-1544) Add support for creating deep decision trees.

2014-04-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1544: - Assignee: Manish Amde Add support for creating deep decision trees

[jira] [Updated] (SPARK-1545) Add Random Forest algorithm to MLlib

2014-04-20 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1545: - Assignee: Manish Amde Add Random Forest algorithm to MLlib

[jira] [Updated] (SPARK-1535) jblas's DoubleMatrix(double[]) ctor creates garbage; avoid

2014-04-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1535: - Assignee: Tor Myklebust jblas's DoubleMatrix(double[]) ctor creates garbage; avoid

[jira] [Resolved] (SPARK-1535) jblas's DoubleMatrix(double[]) ctor creates garbage; avoid

2014-04-19 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1535. -- Resolution: Fixed Fix Version/s: 1.0.0 jblas's DoubleMatrix(double[]) ctor creates

[jira] [Created] (SPARK-1540) Investigate whether we should require keys in PairRDD to be Comparable

2014-04-19 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1540: Summary: Investigate whether we should require keys in PairRDD to be Comparable Key: SPARK-1540 URL: https://issues.apache.org/jira/browse/SPARK-1540 Project: Spark

Re: extremely slow k-means version

2014-04-19 Thread Matei Zaharia
The problem is that groupByKey means “bring all the points with this same key to the same JVM”. Your input is a Seq[Point], so you have to have all the points there. This means that a) all points will be sent across the network in a cluster, which is slow (and Spark goes through this sending

[jira] [Resolved] (SPARK-1462) Examples of ML algorithms are using deprecated APIs

2014-04-16 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1462. -- Resolution: Fixed Fix Version/s: 1.0.0 Examples of ML algorithms are using deprecated

Re: PySpark still reading only text?

2014-04-16 Thread Matei Zaharia
Hi Bertrand, We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161. In 1.0, one feature we do have now is the

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson ilike...@gmail.com wrote: Take a look at the minSplits argument for

Re: Multi-tenant?

2014-04-15 Thread Matei Zaharia
Yes, both things can happen. Take a look at http://spark.apache.org/docs/latest/job-scheduling.html, which includes scheduling concurrent jobs within the same driver. Matei On Apr 15, 2014, at 4:08 PM, Ian Ferreira ianferre...@hotmail.com wrote: What is the support for multi-tenancy in

Re: RDD.tail()

2014-04-14 Thread Matei Zaharia
You can use mapPartitionsWithIndex and look at the partition index (0 will be the first partition) to decide whether to skip the first line. Matei On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote: We have similar needs but IIRC, I came to the conclusion that this would only

Re: process_local vs node_local

2014-04-14 Thread Matei Zaharia
Spark can actually launch multiple executors on the same node if you configure it that way, but if you haven’t done that, this might mean that some tasks are reading data from the cache, and some from HDFS. (In the HDFS case Spark will only report it as NODE_LOCAL since HDFS isn’t tied to a

Re: using Kryo with pyspark?

2014-04-14 Thread Matei Zaharia
Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set spark.serializer and not try to register any classes. What might make more impact is storing data as

[jira] [Created] (SPARK-1484) MLlib should warn if you are using an iterative algorithm on non-cached data

2014-04-13 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1484: Summary: MLlib should warn if you are using an iterative algorithm on non-cached data Key: SPARK-1484 URL: https://issues.apache.org/jira/browse/SPARK-1484 Project

[jira] [Created] (SPARK-1481) Add Naive Bayes to MLlib documentation

2014-04-12 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1481: Summary: Add Naive Bayes to MLlib documentation Key: SPARK-1481 URL: https://issues.apache.org/jira/browse/SPARK-1481 Project: Spark Issue Type: Sub-task

[jira] [Resolved] (SPARK-1241) Support sliding in RDD

2014-04-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1241. -- Resolution: Fixed Fix Version/s: 1.0.0 Support sliding in RDD

[jira] [Commented] (SPARK-1225) ROC AUC and Average Precision for Binary classification models

2014-04-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13966987#comment-13966987 ] Matei Zaharia commented on SPARK-1225: -- Included in https://github.com/apache/spark

[jira] [Commented] (SPARK-1355) Switch website to the Apache CMS

2014-04-11 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967196#comment-13967196 ] Matei Zaharia commented on SPARK-1355: -- I have to say this was pretty good, I avoided

Re: Spark - ready for prime time?

2014-04-11 Thread Matei Zaharia
, Surendranauth Hiraman suren.hira...@velos.io wrote: Matei, Where is the functionality in 0.9 to spill data within a task (separately from persist)? My apologies if this is something obvious but I don't see it in the api docs. -Suren On Thu, Apr 10, 2014 at 3:59 PM, Matei Zaharia

[jira] [Resolved] (SPARK-1428) MLlib should convert non-float64 NumPy arrays to float64 instead of complaining

2014-04-10 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1428. -- Resolution: Fixed Fix Version/s: 1.0.0 MLlib should convert non-float64 NumPy arrays

[jira] [Created] (SPARK-1467) Make StorageLevel.apply() factory methods experimental

2014-04-10 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1467: Summary: Make StorageLevel.apply() factory methods experimental Key: SPARK-1467 URL: https://issues.apache.org/jira/browse/SPARK-1467 Project: Spark Issue

Re: NPE using saveAsTextFile

2014-04-10 Thread Matei Zaharia
I haven’t seen this but it may be a bug in Typesafe Config, since this is serializing a Config object. We don’t actually use Typesafe Config ourselves. Do you have any nulls in the data itself by any chance? And do you know how that Config object is getting there? Matei On Apr 9, 2014, at

Re: Spark - ready for prime time?

2014-04-10 Thread Matei Zaharia
To add onto the discussion about memory working space, 0.9 introduced the ability to spill data within a task to disk, and in 1.0 we’re also changing the interface to allow spilling data within the same *group* to disk (e.g. when you do groupBy and get a key with lots of values). The main

Re: Spark 0.9.1 PySpark ImportError

2014-04-10 Thread Matei Zaharia
Kind of strange because we haven’t updated CloudPickle AFAIK. Is this a package you added on the PYTHONPATH? How did you set the path, was it in conf/spark-env.sh? Matei On Apr 10, 2014, at 7:39 AM, aazout albert.az...@velos.io wrote: I am getting a python ImportError on Spark standalone

Re: Spark 0.9.1 released

2014-04-09 Thread Matei Zaharia
, Chen Chao, Christian Lundgren, Diana Carroll, Emtiaz Ahmed, Frank Dai, Henry Saputra, jianghan, Josh Rosen, Jyotiska NK, Kay Ousterhout, Kousuke Saruta, Mark Grover, Matei Zaharia, Nan Zhu, Nick Lanham, Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang, Raymond Liu, Reynold Xin, Sandy

Re: pySpark memory usage

2014-04-09 Thread Matei Zaharia
) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:85) On Thu, Apr 3, 2014 at 8:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Cool, thanks for the update. Have you tried running a branch with this fix (e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak issue are you

[jira] [Assigned] (SPARK-1428) MLlib should convert non-float64 NumPy arrays to float64 instead of complaining

2014-04-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1428: Assignee: Matei Zaharia MLlib should convert non-float64 NumPy arrays to float64 instead

[jira] [Updated] (SPARK-1428) MLlib should convert non-float64 NumPy arrays to float64 instead of complaining

2014-04-08 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1428: - Assignee: Sandeep Singh (was: Matei Zaharia) MLlib should convert non-float64 NumPy arrays

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-04-07 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962026#comment-13962026 ] Matei Zaharia commented on SPARK-1021: -- Note that if we do this, we'll need a similar

Re: Contributing to Spark

2014-04-07 Thread Matei Zaharia
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them here: https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) Matei On Apr 7, 2014, at 9:45 PM,

Re: ephemeral storage level in spark ?

2014-04-06 Thread Matei Zaharia
The off-heap storage level is currently tied to Tachyon, but it might support other forms of off-heap storage later. However it’s not really designed to be mixed with the other ones. For this use case you may want to rely on memory locality and have some custom code to push the data to the

[jira] [Created] (SPARK-1423) Add scripts for launching Spark on Windows Azure

2014-04-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1423: Summary: Add scripts for launching Spark on Windows Azure Key: SPARK-1423 URL: https://issues.apache.org/jira/browse/SPARK-1423 Project: Spark Issue Type

[jira] [Created] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine

2014-04-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1422: Summary: Add scripts for launching Spark on Google Compute Engine Key: SPARK-1422 URL: https://issues.apache.org/jira/browse/SPARK-1422 Project: Spark Issue

[jira] [Commented] (SPARK-1424) InsertInto should work on JavaSchemaRDD as well.

2014-04-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961283#comment-13961283 ] Matei Zaharia commented on SPARK-1424: -- More generally we should have flags

[jira] [Updated] (SPARK-1421) Make MLlib work on Python 2.6

2014-04-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1421: - Description: Currently it requires Python 2.7 because it uses some new APIs, but they should

[jira] [Updated] (SPARK-1421) Make MLlib work on Python 2.6

2014-04-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1421: - Summary: Make MLlib work on Python 2.6 (was: Make MLlib work on Python 2.6 and NumPy 1.7

[jira] [Created] (SPARK-1426) Make MLlib work with NumPy versions older than 1.7

2014-04-05 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1426: Summary: Make MLlib work with NumPy versions older than 1.7 Key: SPARK-1426 URL: https://issues.apache.org/jira/browse/SPARK-1426 Project: Spark Issue Type

[jira] [Assigned] (SPARK-1421) Make MLlib work on Python 2.6

2014-04-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1421: Assignee: Matei Zaharia Make MLlib work on Python 2.6

[jira] [Resolved] (SPARK-1421) Make MLlib work on Python 2.6

2014-04-05 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1421. -- Resolution: Fixed Fix Version/s: 0.9.2 1.0.0 Make MLlib work

[jira] [Resolved] (SPARK-1133) Add a new small files input for MLlib, which will return an RDD[(fileName, content)]

2014-04-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1133. -- Resolution: Fixed Fix Version/s: 1.0.0 Add a new small files input for MLlib, which

[jira] [Assigned] (SPARK-1414) Python API for SparkContext.wholeTextFiles

2014-04-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia reassigned SPARK-1414: Assignee: Matei Zaharia Python API for SparkContext.wholeTextFiles

[jira] [Created] (SPARK-1416) Add support for SequenceFiles in PySpark

2014-04-04 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1416: Summary: Add support for SequenceFiles in PySpark Key: SPARK-1416 URL: https://issues.apache.org/jira/browse/SPARK-1416 Project: Spark Issue Type

[jira] [Resolved] (SPARK-1198) Allow pipes tasks to run in different sub-directories

2014-04-04 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1198. -- Resolution: Fixed Fix Version/s: 1.0.0 Allow pipes tasks to run in different sub

Re: How are exceptions in map functions handled in Spark?

2014-04-04 Thread Matei Zaharia
Exceptions should be sent back to the driver program and logged there (with a SparkException thrown if a task fails more than 4 times), but there were some bugs before where this did not happen for non-Serializable exceptions. We changed it to pass back the stack traces only (as text), which

Re: example of non-line oriented input data?

2014-04-04 Thread Matei Zaharia
, Mar 18, 2014 at 8:14 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW one other thing — in your experience, Diana, which non-text InputFormats would be most useful to support in Python first? Would it be Parquet or Avro, simple SequenceFiles with the Hadoop Writable types, or something

Re: Having spark-ec2 join new slaves to existing cluster

2014-04-04 Thread Matei Zaharia
This can’t be done through the script right now, but you can do it manually as long as the cluster is stopped. If the cluster is stopped, just go into the AWS Console, right click a slave and choose “launch more of these” to add more. Or select multiple slaves and delete them. When you run

Re: Spark on other parallel filesystems

2014-04-04 Thread Matei Zaharia
As long as the filesystem is mounted at the same path on every node, you should be able to just run Spark and use a file:// URL for your files. The only downside with running it this way is that Lustre won’t expose data locality info to Spark, the way HDFS does. That may not matter if it’s a

[jira] [Updated] (SPARK-1162) Add top() and takeOrdered() to PySpark

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1162: - Fix Version/s: 0.9.2 1.0.0 Add top() and takeOrdered() to PySpark

[jira] [Resolved] (SPARK-1162) Add top() and takeOrdered() to PySpark

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1162. -- Resolution: Fixed Add top() and takeOrdered() to PySpark

[jira] [Updated] (SPARK-1162) Add top() and takeOrdered() to PySpark

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1162: - Assignee: Prashant Sharma (was: prashant) Add top() and takeOrdered() to PySpark

[jira] [Resolved] (SPARK-1333) Java API for running SQL queries

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1333. -- Resolution: Fixed Java API for running SQL queries

[jira] [Updated] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1134: - Affects Version/s: 0.9.1 ipython won't run standalone python script

[jira] [Resolved] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-1134. -- Resolution: Fixed ipython won't run standalone python script

[jira] [Updated] (SPARK-1134) ipython won't run standalone python script

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1134: - Fix Version/s: 0.9.2 1.0.0 ipython won't run standalone python script

[jira] [Updated] (SPARK-1296) Make RDDs Covariant

2014-04-03 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1296: - Fix Version/s: (was: 1.0.0) Make RDDs Covariant --- Key

[jira] [Created] (SPARK-1413) Parquet messes up stdout and stdin when used in Spark REPL

2014-04-03 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-1413: Summary: Parquet messes up stdout and stdin when used in Spark REPL Key: SPARK-1413 URL: https://issues.apache.org/jira/browse/SPARK-1413 Project: Spark

Re: Spark 1.0.0 release plan

2014-04-03 Thread Matei Zaharia
Hey Bhaskar, this is still the plan, though QAing might take longer than 15 days. Right now since we’ve passed April 1st, the only features considered for a merge are those that had pull requests in review before. (Some big ones are things like annotating the public APIs and simplifying

Re: Optimal Server Design for Spark

2014-04-03 Thread Matei Zaharia
, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Steve, This configuration sounds pretty good. The one thing I would consider is having more disks, for two reasons — Spark uses the disks for large shuffles and out-of-core operations, and often it’s better to run HDFS or your

[jira] [Updated] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size

2014-04-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1391: - Assignee: Min Zhou BlockManager cannot transfer blocks larger than 2G in size

[jira] [Updated] (SPARK-1133) Add a new small files input for MLlib, which will return an RDD[(fileName, content)]

2014-04-02 Thread Matei Zaharia (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1133: - Assignee: Xusen Yin Add a new small files input for MLlib, which will return an RDD[(fileName

Re: Mllib in pyspark for 0.8.1

2014-04-01 Thread Matei Zaharia
You could probably port it back, but it required some changes on the Java side as well (a new PythonMLUtils class). It might be easier to fix the Mesos issues with 0.9. Matei On Apr 1, 2014, at 8:53 AM, Ian Ferreira ianferre...@hotmail.com wrote: Hi there, For some reason the

Re: [shark-users] SQL on Spark - Shark or SparkSQL

2014-03-30 Thread Matei Zaharia
Hi Manoj, At the current time, for drop-in replacement of Hive, it will be best to stick with Shark. Over time, Shark will use the Spark SQL backend, but should remain deployable the way it is today (including launching the SharkServer, using the Hive CLI, etc). Spark SQL is better for

Re: Scala 2.10.4

2014-03-28 Thread Matei Zaharia
worth the annoyance of everyone needing to download a new version of Scala, making yet another version of the AMIs, etc. -Kay On Thu, Mar 27, 2014 at 4:33 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Sounds good. Feel free to send a PR even though it's a small change (it leads

Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample.

Re: Scala 2.10.4

2014-03-27 Thread Matei Zaharia
Sounds good. Feel free to send a PR even though it’s a small change (it leads to better Git history and such). Matei On Mar 27, 2014, at 4:15 PM, Mark Hamstra m...@clearstorydata.com wrote: FYI, Spark master does build cleanly and the tests do run successfully with Scala version set to

Re: pySpark memory usage

2014-03-27 Thread Matei Zaharia
exceptions, but I think they all stem from the above, eg. org.apache.spark.SparkException: Error sending message to BlockManagerMaster Let me know if there are other settings I should try, or if I should try a newer snapshot. Thanks again! On Mon, Mar 24, 2014 at 9:35 AM, Matei Zaharia

Re: Announcing Spark SQL

2014-03-26 Thread Matei Zaharia
Congrats Michael co for putting this together — this is probably the neatest piece of technology added to Spark in the past few months, and it will greatly change what users can do as more data sources are added. Matei On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com

Re: All pairs shortest paths?

2014-03-26 Thread Matei Zaharia
wrote: Much thanks, I suspected this would be difficult. I was hoping to generate some 4 degrees of separation-like statistics. Looks like I'll just have to work with a subset of my graph. On Wed, Mar 26, 2014 at 5:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote: All-pairs distances

Re: [VOTE] Release Apache Spark 0.9.1 (rc1)

2014-03-25 Thread Matei Zaharia
+1 looks good to me. I tried both the source and CDH4 versions and looked at the new streaming docs. The release notes seem slightly incomplete, but I guess you’re still working on them? Anyway those don’t go into the release tarball so it’s okay. Matei On Mar 24, 2014, at 2:01 PM, Tathagata

Re: [VOTE] Release Apache Spark 0.9.1 (rc1)

2014-03-25 Thread Matei Zaharia
it on a Linux cluster. I opened https://spark-project.atlassian.net/browse/SPARK-1326 to track it. We can put it in another RC if we find bigger issues. Matei On Mar 25, 2014, at 10:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: +1 looks good to me. I tried both the source and CDH4 versions

Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi

Re: new Catalyst/SQL component merged into master

2014-03-21 Thread Matei Zaharia
Congrats Michael and all for getting this so far. Spark SQL and Catalyst will make it much easier to use structured data in Spark, and open the door for some very cool extensions later. Matei On Mar 20, 2014, at 11:15 PM, Heiko Braun ike.br...@googlemail.com wrote: Congrats! That's a really

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
Try passing the shuffle=true parameter to coalesce, then it will do the map in parallel but still pass all the data through one reduce node for writing it out. That’s probably the fastest it will get. No need to cache if you do that. Matei On Mar 21, 2014, at 4:04 PM, Aureliano Buendia

Re: How to save as a single file efficiently?

2014-03-21 Thread Matei Zaharia
, at 5:01 PM, Aureliano Buendia buendia...@gmail.com wrote: Good to know it's as simple as that! I wonder why shuffle=true is not the default for coalesce(). On Fri, Mar 21, 2014 at 11:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try passing the shuffle=true parameter to coalesce

Re: Pyspark worker memory

2014-03-20 Thread Matei Zaharia
-Dspark.executor.memory in SPARK_JAVA_OPTS *on the master*. I'm not sure how this varies from 0.9.0 release, but it seems to work on SNAPSHOT. On Tue, Mar 18, 2014 at 11:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try checking spark-env.sh on the workers as well. Maybe code there is somehow

Re: DStream spark paper

2014-03-20 Thread Matei Zaharia
Hi Adrian, On every timestep of execution, we receive new data, then report updated word counts for that new data plus the past 30 seconds. The latency here is about how quickly you get these updated counts once the new batch of data comes in. It’s true that the count reflects some data from

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-19 Thread Matei Zaharia
at 12:15 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro

Re: Pyspark worker memory

2014-03-19 Thread Matei Zaharia
Try checking spark-env.sh on the workers as well. Maybe code there is somehow overriding the spark.executor.memory setting. Matei On Mar 18, 2014, at 6:17 PM, Jim Blomo jim.bl...@gmail.com wrote: Hello, I'm using the Github snapshot of PySpark and having trouble setting the worker memory

Re: What's the lifecycle of an rdd? Can I control it?

2014-03-19 Thread Matei Zaharia
Yes, Spark automatically removes old RDDs from the cache when you make new ones. Unpersist forces it to remove them right away. In both cases though, note that Java doesn’t garbage-collect the objects released until later. Matei On Mar 19, 2014, at 7:22 PM, Nicholas Chammas

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2014-03-18 Thread Matei Zaharia
Hey Nick, I’m curious, have you been doing any further development on this? It would be good to get expanded InputFormat support in Spark 1.0. To start with we don’t have to do SequenceFiles in particular, we can do stuff like Avro (if it’s easy to read in Python) or some kind of

Re: Incrementally add/remove vertices in GraphX

2014-03-18 Thread Matei Zaharia
I just meant that you call union() before creating the RDDs that you pass to new Graph(). If you call it after it will produce other RDDs. The Graph() constructor actually shuffles and “indexes” the data to make graph operations efficient, so it’s not too easy to add elements after. You could

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
Hi Diana, Non-text input formats are only supported in Java and Scala right now, where you can use sparkContext.hadoopFile or .hadoopDataset to load data with any InputFormat that Hadoop MapReduce supports. In Python, you unfortunately only have textFile, which gives you one record per line.

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
to me how to do that as I probably should be. Thanks, Diana On Mon, Mar 17, 2014 at 1:02 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Diana, Non-text input formats are only supported in Java and Scala right now, where you can use sparkContext.hadoopFile or .hadoopDataset

Re: is collect exactly-once?

2014-03-17 Thread Matei Zaharia
Yup, it only returns each value once. Matei On Mar 17, 2014, at 1:14 PM, Adrian Mocanu amoc...@verticalscope.com wrote: Hi Quick question here, I know that .foreach is not idempotent. I am wondering if collect() is idempotent? Meaning that once I’ve collect()-ed if spark node crashes I

Re: example of non-line oriented input data?

2014-03-17 Thread Matei Zaharia
) On Mon, Mar 17, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Here’s an example of getting together all lines in a file as one string: $ cat dir/a.txt Hello world! $ cat dir/b.txt What's up?? $ bin/pyspark files = sc.textFile(“dir”) files.collect() [u'Hello

Re: links for the old versions are broken

2014-03-17 Thread Matei Zaharia
Thanks for reporting this, looking into it. On Mar 17, 2014, at 2:44 PM, Walrus theCat walrusthe...@gmail.com wrote: ping On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson ilike...@gmail.com wrote: Looks like everything from 0.8.0 and before errors similarly (though Spark 0.3 for Scala

Re: possible bug in Spark's ALS implementation...

2014-03-16 Thread Matei Zaharia
On Mar 14, 2014, at 5:52 PM, Michael Allman m...@allman.ms wrote: I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is

Re: How to kill a spark app ?

2014-03-16 Thread Matei Zaharia
If it’s a driver on the cluster, please open a JIRA issue about this — this kill command is indeed intended to work. Matei On Mar 16, 2014, at 2:35 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Are you embedding your driver inside the cluster? If not then that command will not kill the

Re: [Powered by] Yandex Islands powered by Spark

2014-03-16 Thread Matei Zaharia
Thanks, I’ve added you: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark. Let me know if you want to change any wording. Matei On Mar 16, 2014, at 6:48 AM, Egor Pahomov pahomov.e...@gmail.com wrote: Hi, page https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Re: [DISCUSS] Necessity of Maven *and* SBT Build in Spark

2014-03-14 Thread Matei Zaharia
I like the pom-reader approach as well — in particular, that it lets you add extra stuff in your SBT build after loading the dependencies from the POM. Profiles would be the one thing missing to be able to pass options through. Matei On Mar 14, 2014, at 10:03 AM, Patrick Wendell

Re: NO SUCH METHOD EXCEPTION

2014-03-11 Thread Matei Zaharia
Since it’s from Scala, it might mean you’re running with a different version of Scala than you compiled Spark with. Spark 0.8 and earlier use Scala 2.9, while Spark 0.9 uses Scala 2.10. Matei On Mar 11, 2014, at 8:19 AM, Jeyaraj, Arockia R (Arockia) arockia.r.jeya...@verizon.com wrote: Hi,

Re: Powered By Spark Page -- Companies Organizations

2014-03-11 Thread Matei Zaharia
Thanks, added you. On Mar 11, 2014, at 2:47 AM, Christoph Böhm listenbru...@gmx.net wrote: Dear Spark team, thanks for the great work and congrats on becoming an Apache top-level project! You could add us to your Powered-by-page, because we are using Spark (and Shark) to perform

Re: RDD.saveAs...

2014-03-11 Thread Matei Zaharia
I agree that we can’t keep adding these to the core API, partly because it will get unwieldy to maintain and partly just because each storage system will bring in lots of dependencies. We can simply have helper classes in different modules for each storage system. There’s some discussion on

Re: major Spark performance problem

2014-03-09 Thread Matei Zaharia
Hi Dana, It’s hard to tell exactly what is consuming time, but I’d suggest starting by profiling the single application first. Three things to look at there: 1) How many stages and how many tasks per stage is Spark launching (in the application web UI at http://driver:4040)? If you have

Re: QR decomposition in Spark ALS

2014-03-06 Thread Matei Zaharia
Xt*X should mathematically always be positive semi-definite, so the only way this might be bad is if it’s not invertible due to linearly dependent rows. This might happen due to the initialization or possibly due to numerical issues, though it seems unlikely. Maybe it also happens if some users

Re: QR decomposition in Spark ALS

2014-03-06 Thread Matei Zaharia
definite. -- Sean Owen | Director, Data Science | London On Thu, Mar 6, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Xt*X should mathematically always be positive semi-definite, so the only way this might be bad is if it’s not invertible due to linearly dependent rows

Re: QR decomposition in Spark ALS

2014-03-06 Thread Matei Zaharia
, If the data has linearly dependent rows ALS should have a failback mechanism. Either remove the rows and then call BLAS posv or call BLAS gesv or Breeze QR decomposition. I can share the analysis over email. Thanks. Deb On Thu, Mar 6, 2014 at 9:39 AM, Matei Zaharia matei.zaha

<    6   7   8   9   10   11   12   13   14   15   >