Re: QR decomposition in Spark ALS

2014-03-06 Thread Sean Owen
decomposition is pretty good here, yes. -- Sean Owen | Director, Data Science | London On Thu, Mar 6, 2014 at 3:05 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Sebastian, Yes Mahout ALS and Oryx runs fine on the same matrix because Sean calls QR decomposition. But the ALS objective should

Re: QR decomposition in Spark ALS

2014-03-06 Thread Sean Owen
-revealing and can help detect this situation. That is why I had just used the QR decomposition. I agree that it is almost surely a synthetic or flawed setup that causes this situation, but, those things do happen. -- Sean Owen | Director, Data Science | London On Thu, Mar 6, 2014 at 6:51 PM, Matei

Re: cloudera repo down again - mqtt

2014-03-14 Thread Sean Owen
in the parent pom, and ahead of the Cloudera repo. This causes it to be tried first, which is appropriate. Any +1 for either of those changes? -- Sean Owen | Director, Data Science | London On Fri, Mar 14, 2014 at 7:37 AM, Tom Graves tgraves...@yahoo.com wrote: It appears the cloudera repo for the mqtt

Re: cloudera repo down again - mqtt

2014-03-14 Thread Sean Owen
their order can be controlled as desired. Child pom repos come after parent repos and that, while it rarely makes any difference, isn't actually desirable. I'll prep a PR but wait for someone else to second a change like that. -- Sean Owen | Director, Data Science | London On Fri, Mar 14, 2014 at 7:57

Re: cloudera repo down again - mqtt

2014-03-14 Thread Sean Owen
PS the Cloudera cert issue was cleared up a few hours ago; give it a spin. On Fri, Mar 14, 2014 at 8:22 AM, Sean Owen so...@cloudera.com wrote: Yes, I'm using Maven 3.2.1. Actually, scratch that, it fails for me too once it gets down into the MQTT module, with a clearer error

Re: ALS memory limits

2014-03-26 Thread Sean Owen
Much of this sounds related to the memory issue mentioned earlier in this thread. Are you using a build that has fixed that? That would be by far most important here. If the raw memory requirement is 8GB, the actual heap size necessary could be a lot larger -- object overhead, all the other stuff

Re: Master compilation

2014-04-05 Thread Sean Owen
. -- Sean Owen | Director, Data Science | London On Sat, Apr 5, 2014 at 11:06 PM, Patrick Wendell pwend...@gmail.com wrote: If you want to submit a hot fix for this issue specifically please do. I'm not sure why it didn't fail our build... On Sat, Apr 5, 2014 at 2:30 PM, Debasish Das debasish.da

Re: Master compilation

2014-04-06 Thread Sean Owen
scala.None certainly isn't new in 2.10.4; it's ancient : http://www.scala-lang.org/api/2.10.3/index.html#scala.None$ Surely this is some other problem? On Sun, Apr 6, 2014 at 6:46 PM, Koert Kuipers ko...@tresata.com wrote: also, i thought scala 2.10 was binary compatible, but does not seem to

Re: Tests failed after assembling the latest code from github

2014-04-15 Thread Sean Owen
Good call -- indeed that same Files class has a move() method that will try to use renameTo() and then fall back to copy() and delete() if needed for this very reason. On Tue, Apr 15, 2014 at 6:34 AM, Ye Xianjin advance...@gmail.com wrote: Hi, I think I have found the cause of the tests

Re: Any plans for new clustering algorithms?

2014-04-21 Thread Sean Owen
On Mon, Apr 21, 2014 at 6:03 PM, Paul Brown p...@mult.ifario.us wrote: - MLlib as Mahout.next would be a unfortunate. There are some gems in Mahout, but there are also lots of rocks. Setting a minimal bar of working, correctly implemented, and documented requires a surprising amount of work.

Re: [jira] [Created] (SPARK-1698) Improve spark integration

2014-05-02 Thread Sean Owen
#1 and #2 are not relevant the issue of jar size. These can be problems in general, but don't think there have been issues attributable to file clashes. Shading has mechanisms to deal with this anyway. #3 is a problem in general too, but is not specific to shading. Where versions collide, build

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-14 Thread Sean Owen
On Tue, May 13, 2014 at 2:49 PM, Sean Owen so...@cloudera.com wrote: On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell pwend...@gmail.com wrote: The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.0.0-rc5/ Good news

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Sean Owen
On this note, non-binding commentary: Releases happen in local minima of change, usually created by internally enforced code freeze. Spark is incredibly busy now due to external factors -- recently a TLP, recently discovered by a large new audience, ease of contribution enabled by Github. It's

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-17 Thread Sean Owen
On Sat, May 17, 2014 at 4:52 PM, Mark Hamstra m...@clearstorydata.com wrote: Which of the unresolved bugs in spark-core do you think will require an API-breaking change to fix? If there are none of those, then we are still essentially on track for a 1.0.0 release. I don't have a particular

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sean Owen
I might be stating the obvious for everyone, but the issue here is not reflection or the source of the JAR, but the ClassLoader. The basic rules are this. new Foo will use the ClassLoader that defines Foo. This is usually the ClassLoader that loaded whatever it is that first referenced Foo and

Re: Calling external classes added by sc.addJar needs to be through reflection

2014-05-19 Thread Sean Owen
Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, May 18, 2014 at 11:57 PM, Sean Owen so...@cloudera.com wrote: I might be stating the obvious for everyone, but the issue here is not reflection or the source of the JAR, but the ClassLoader. The basic rules

Re: Sorting partitions in Java

2014-05-20 Thread Sean Owen
It's an Iterator in both Java and Scala. In both cases you need to copy the stream of values into something List-like to sort it. An Iterable would not change that (not sure the API can promise many iterations anyway). If you just want the equivalent of toArray, you can use a utility method in

Re: BUG: graph.triplets does not return proper values

2014-05-20 Thread Sean Owen
http://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions It becomes automagically available when your RDD contains pairs. On Tue, May 20, 2014 at 9:00 PM, GlennStrycker glenn.stryc...@gmail.com wrote: I don't seem to have this function in my Spark

Kafka + Spark Streaming and NoSuchMethodError, related to Manifest / reflection?

2014-05-27 Thread Sean Owen
I'd like to resurrect this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c6d657d19-1ecf-4e92-bf15-cc4762ef9...@thekratos.com%3E Basically when you call this particular Java-flavored overloading of KafkaUtils.createStream:

Re: FYI -- javax.servlet dependency issue workaround

2014-05-28 Thread Sean Owen
This class was introduced in Servlet 3.0. We have in the dependency tree some references to Servlet 2.5 and Servlet 3.0. The latter is a superset of the former. So we standardized on depending on Servlet 3.0. At least, that seems to have been successful in the Maven build, but this is just

Re: Streaming example stops outputting (Java, Kafka at least)

2014-05-30 Thread Sean Owen
Thanks Nan, that does appear to fix it. I was using local. Can anyone say whether that's to be expected or whether it could be a bug somewhere? On Fri, May 30, 2014 at 2:42 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Sean I was in the same problem but when I changed MASTER=“local” to

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-06-02 Thread Sean Owen
On Mon, Jun 2, 2014 at 6:05 PM, Marcelo Vanzin van...@cloudera.com wrote: You mentioned something in your shading argument that kinda reminded me of something. Spark currently depends on slf4j implementations and log4j with compile scope. I'd argue that's the wrong approach if we're talking

Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0-1.0.0

2014-06-08 Thread Sean Owen
I suspect Patrick is right about the cause. The Maven artifact that was released does contain this class (phew) http://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-core_2.10%7C1.0.0%7Cjar As to the hadoop1 / hadoop2 artifact question -- agree that is often done. Here the working

Re: implementing the VectorAccumulatorParam

2014-06-09 Thread Sean Owen
(The user@ list might be a bit better but I can see why it might look like a dev@ question.) Did you import org.apache.spark.mllib.linalg.Vector ? I think you are picking up Scala's Vector class instead. On Mon, Jun 9, 2014 at 11:57 AM, dataginjaninja rickett.stepha...@gmail.com wrote: The

Re: implementing the VectorAccumulatorParam

2014-06-09 Thread Sean Owen
(BCC dev@) The example is out of date with respect to current Vector class. The zeros() method is on Vectors. There is not currently a += operation for Vector anymore. To be fair the example doesn't claim this illustrates use of the Spark Vector class but it did work with the now-deprecated

Re: Could the function MLUtils.loadLibSVMFile be modified to support zero-based-index data?

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 7:29 AM, Lizhengbing (bing, BIPA) zhengbing...@huawei.com wrote: 1) I download the imdb data from http://komarix.org/ac/ds/Blanc__Mel.txt.bz2 and use this data to test LBFGS 2) I find the imdb data are zero-based-index data Since the method is for parsing the

Re: Catalyst dependency on Spark Core

2014-07-15 Thread Sean Owen
Agree. You end up with a core and a corer core to distinguish between and it ends up just being more complicated. This sounds like something that doesn't need a module. On Tue, Jul 15, 2014 at 5:59 AM, Patrick Wendell pwend...@gmail.com wrote: Adding new build modules is pretty high overhead, so

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Are you setting -Pyarn-alpha? ./sbt/sbt -Pyarn-alpha, followed by projects, shows it as a module. You should only build yarn-stable *or* yarn-alpha at any given time. I don't remember the modules changing in a while. 'yarn-alpha' is for YARN before it stabilized, circa early Hadoop 2.0.x.

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
This looks like a Jetty version problem actually. Are you bringing in something that might be changing the version of Jetty used by Spark? It depends a lot on how you are building things. Good to specify exactly how your'e building here. On Thu, Jul 17, 2014 at 3:43 PM, Nathan Kronenfeld

Re: Possible bug in ClientBase.scala?

2014-07-17 Thread Sean Owen
Looks like a real problem. I see it too. I think the same workaround found in ClientBase.scala needs to be used here. There, the fact that this field can be a String or String[] is handled explicitly. In fact I think you can just call to ClientBase for this? PR it, I say. On Thu, Jul 17, 2014 at

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
, 2014 at 10:56 AM, Sean Owen so...@cloudera.com wrote: This looks like a Jetty version problem actually. Are you bringing in something that might be changing the version of Jetty used by Spark? It depends a lot on how you are building things. Good to specify exactly how your'e building here

Re: Compile error when compiling for cloudera

2014-07-17 Thread Sean Owen
can make this change after 1642 is through. On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen so...@cloudera.com wrote: CC tmalaska since he touched the line in question. This is a fun one. So, here's the line of code added last week: val channelFactory = new NioServerSocketChannelFactory

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

2014-07-27 Thread Sean Owen
Good idea, although it gets difficult in the context of multiple distributions. Say change X is not present in version A, but present in version B. If you depend on X, what version can you look for to detect it? The distribution will return A or A+X or somesuch, but testing for A will give an

Re: Working Formula for Hive 0.13?

2014-07-28 Thread Sean Owen
%7Corg.apache.hive%7Chive-exec%7C0.13.1%7Cjar Should a JIRA be opened so that dependency on hive-metastore can be replaced by dependency on hive-exec ? Cheers On Mon, Jul 28, 2014 at 8:26 AM, Sean Owen so...@cloudera.com wrote: The reason for org.spark-project.hive is that Spark relies

Re: Intellij IDEA can not recognize the MLlib package

2014-08-03 Thread Sean Owen
You missed the mllib artifact? that would certainly explain it! all I see is core. On Sun, Aug 3, 2014 at 10:03 AM, jun kit...@126.com wrote: Hi, I have started my spark exploration in intellij IDEA local model and want to focus on MLlib part. but when I put some example codes in IDEA, It

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
For any Hadoop 2.4 distro, yes, set hadoop.version but also set -Phadoop-2.4. http://spark.apache.org/docs/latest/building-with-maven.html On Mon, Aug 4, 2014 at 9:15 AM, Patrick Wendell pwend...@gmail.com wrote: For hortonworks, I believe it should work to just link against the corresponding

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Sean Owen
What would such a profile do though? In general building for a specific vendor version means setting hadoop.verison and/or yarn.version. Any hard-coded value is unlikely to match what a particular user needs. Setting protobuf versions and so on is already done by the generic profiles. In a

Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread Sean Owen
I think your best bet by far is to consume the Maven build as-is from within Eclipse. I wouldn't try to export a project config from the build as there is plenty to get lost in translation. Certainly this works well with IntelliJ, and by the by, if you have a choice, I would strongly recommend

Re: Buidling spark in Eclipse Kepler

2014-08-07 Thread Sean Owen
(Don't use gen-idea, just open it directly as a Maven project in IntelliJ.) On Thu, Aug 7, 2014 at 4:53 AM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: So I downloaded community edition of IntelliJ, and ran sbt/sbt gen-idea. I then imported the pom.xml file. I'm still getting all sorts of

Re: Documentation confusing or incorrect for decision trees?

2014-08-07 Thread Sean Owen
It's definitely just a typo. The ordered categories are A, C, B so the other split can't be A | B, C. Just open a PR. On Thu, Aug 7, 2014 at 2:11 AM, Matt Forbes m...@tellapart.com wrote: I found the section on ordering categorical features really interesting, but the A, B, C example seemed

Re: Unit tests in 5 minutes

2014-08-08 Thread Sean Owen
A common approach is to separate unit tests from integration tests. Maven has support for this distinction. I'm not sure it helps a lot though, since it only helps you to not run integration tests all the time. But lots of Spark tests are integration-test-like and are important to run to know a

Re: More productive compilation of spark code base

2014-08-11 Thread Sean Owen
Try setting it to handle incremental compilation of Scala by itself (IntelliJ) and to run its own compile server. This is in global settings, under the Scala settings. It seems to compile incrementally for me when I change a file or two. On Mon, Aug 11, 2014 at 8:57 PM, Ron's Yahoo!

Re: is Branch-1.1 SBT build broken for yarn-alpha ?

2014-08-21 Thread Sean Owen
Maven is just telling you that there is no version 1.1.0 of yarn-parent, and indeed, it has not been released. To build the branch you would need to mvn install to compile and make available local copies of artifacts along the way. (You may have these for 1.1.0-SNAPSHOT locally already). Use

Re: reference to dstream in package org.apache.spark.streaming which is not available

2014-08-22 Thread Sean Owen
Yes, master hasn't compiled for me for a few days. It's fixed in: https://github.com/apache/spark/pull/1726 https://github.com/apache/spark/pull/2075 Could a committer sort this out? Sean On Fri, Aug 22, 2014 at 9:55 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, Using the following command on

Re: Problems running examples in IDEA

2014-08-24 Thread Sean Owen
The examples aren't runnable quite like this. It's intended that they are submitted to a cluster with spark-submit, which would among other things provide Spark at runtime. I think you might get them to run this way if you set master to local[*] and indeed made a run profile that also included

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Sean Owen
+1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail any more than usual. FWIW I've also been using the 1.1.0-SNAPSHOT for some time in another project and have encountered no problems. I notice that the 1.1.0

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Sean Owen
popular Hadoop versions to lower the bar for users to build and test Spark. - Patrick On Thu, Aug 28, 2014 at 11:04 PM, Sean Owen so...@cloudera.com wrote: +1 I tested the source and Hadoop 2.4 release. Checksums and signatures are OK. Compiles fine with Java 8 on OS X. Tests... don't fail

Re: [VOTE] Release Apache Spark 1.1.0 (RC2)

2014-08-29 Thread Sean Owen
On Fri, Aug 29, 2014 at 7:42 AM, Patrick Wendell pwend...@gmail.com wrote: In terms of vendor support for this approach - In the early days Cloudera asked us to add CDH4 repository and more recently Pivotal and MapR also asked us to allow linking against their hadoop-client libraries. So we've

Re: [SPARK-3324] make yarn module as a unified maven jar project

2014-08-31 Thread Sean Owen
This isn't possible since the two versions of YARN are mutually incompatible at compile-time. However see my comments about how this could be restructured to be a little more standard, and so that IntelliJ would parse it out of the box. Still I imagine it is not worth it if YARN alpha will go

Re: [SPARK-3324] make yarn module as a unified maven jar project

2014-08-31 Thread Sean Owen
/sources /configuration /execution /executions /plugin On Aug 31, 2014, at 16:19, Sean Owen so...@cloudera.com wrote: This isn't possible since the two versions of YARN are mutually incompatible at compile-time. However see my comments

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-31 Thread Sean Owen
All the signatures are correct. The licensing all looks fine. The source builds fine. Now, let me ask about unit tests, since I had a more detailed look, which I should have done before. dev/run-tests fails two tests (1 Hive, 1 Kafka Streaming) for me locally on 1.1.0-rc3. Does anyone else see

Re: [VOTE] Release Apache Spark 1.1.0 (RC3)

2014-08-31 Thread Sean Owen
Fantastic. As it happens, I just fixed up Mahout's tests for Java 8 and observed a lot of the same type of failure. I'm about to submit PRs for the two issues I identified. AFAICT these 3 then cover the failures I mentioned: https://issues.apache.org/jira/browse/SPARK-3329

Re: about spark assembly jar

2014-09-02 Thread Sean Owen
Hm, are you suggesting that the Spark distribution be a bag of 100 JARs? It doesn't quite seem reasonable. It does not remove version conflicts, just pushes them to run-time, which isn't good. The assembly is also necessary because that's where shading happens. In development, you want to run

Re: [VOTE] Release Apache Spark 1.1.0 (RC4)

2014-09-03 Thread Sean Owen
+1 signatures still fine, tests still pass. On Mac OS X I get the following failure but I think it's spurious. Only mentioning it to see if anyone else sees it. It doesn't happen on Linux. [error] Test org.apache.spark.streaming.kafka.JavaKafkaStreamSuite.testKafkaStream failed:

Re: Dependency hell in Spark applications

2014-09-04 Thread Sean Owen
Dumb question -- are you using a Spark build that includes the Kinesis dependency? that build would have resolved conflicts like this for you. Your app would need to use the same version of the Kinesis client SDK, ideally. All of these ideas are well-known, yes. In cases of super-common

Re: trimming unnecessary test output

2014-09-06 Thread Sean Owen
This is just a line logging that one test succeeded right? I don't find that noise. Recently I wanted to search test run logs for a test case success and it was important that the individual test case was logged. On Sep 6, 2014 4:13 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: jenkins failed all tests?

2014-09-07 Thread Sean Owen
It would help to point to your change. Are you sure it was only docs and are you sure you're rebased, submitting against the right branch? Jenkins is saying you are changing public APIs; it's not reporting test failures. But it could well be a test/Jenkins problem. On Sun, Sep 7, 2014 at 8:39 PM,

Re: RFC: Deprecating YARN-alpha API's

2014-09-09 Thread Sean Owen
FWIW consensus from Cloudera folk seems to be that there's no need or demand on this end for YARN alpha. It wouldn't have an impact if it were removed sooner even. It will be a small positive to reduce complexity by removing this support, making it a little easier to develop for current YARN

Re: Questions regarding memory usage

2014-09-12 Thread Sean Owen
On Thu, Sep 11, 2014 at 10:17 PM, Tom thubregt...@gmail.com wrote: If I set SPARK_DRIVER_MEMORY to x GB, Spark reports /14/09/11 15:36:41 INFO MemoryStore: MemoryStore started with capacity ~0.55*x GB/ *Question:* Does this relate to spark.storage.memoryFraction (default 0.6), and is the

Re: Local tests logging to log4j

2014-10-07 Thread Sean Owen
What has worked for me is to bundle log4j.properties in the root of the application's .jar file, since log4j will look for it there, and configuring log4j will turn off Spark's default log4j configuration. I don't think conf/log4j.properties is going to do anything by itself, but

Decision forests don't work with non-trivial categorical features

2014-10-12 Thread Sean Owen
I'm having trouble getting decision forests to work with categorical features. I have a dataset with a categorical feature with 40 values. It seems to be treated as a continuous/numeric value by the implementation. Digging deeper, I see there is some logic in the code that indicates that

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Sean Owen
heuristic when storing histogram bins (and searching for optimal splits) in the tree code. Maybe Manish or Joseph can clarify? On Oct 12, 2014, at 2:50 PM, Sean Owen so...@cloudera.com wrote: I'm having trouble getting decision forests to work with categorical features. I have a dataset

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Sean Owen
, Sean Owen so...@cloudera.com wrote: I'm looking at this bit of code in DecisionTreeMetadata ... val maxCategoriesForUnorderedFeature = ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories

Re: Decision forests don't work with non-trivial categorical features

2014-10-13 Thread Sean Owen
Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send some details directly under separate cover. On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley jos...@databricks.com wrote: Hi Sean, Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) Short version: That

Re: Issues with ALS positive definite

2014-10-16 Thread Sean Owen
It Gramian is at least positive semidefinite and will be definite if the matrix is non singular, yes. That's usually but not always true. The lambda*I matrix is positive definite, well, when lambda is positive. Adding that makes it definite. At least, lambda=0 could be rejected as invalid. But

Re: Building and Running Spark on OS X

2014-10-20 Thread Sean Owen
Maven is at least built in to OS X (well, with dev tools). You don't even have to brew install it. Surely SBT isn't in the dev tools even? I recall I had to install it. I'd be surprised to hear it required zero setup. On Mon, Oct 20, 2014 at 8:04 PM, Nicholas Chammas nicholas.cham...@gmail.com

Re: Building and Running Spark on OS X

2014-10-20 Thread Sean Owen
Oh right, we're talking about the bundled sbt of course. And I didn't know Maven wasn't installed anymore! On Mon, Oct 20, 2014 at 8:20 PM, Hari Shreedharan hshreedha...@cloudera.com wrote: The sbt executable that is in the spark repo can be used to build sbt without any other set up (it will

Easy win: SBT plugin config expert to help on SPARK-3359?

2014-10-21 Thread Sean Owen
This one can be resolved, I think, with a bit of help from someone who understands SBT + plugin config: https://issues.apache.org/jira/browse/SPARK-3359 Just a matter of figuring out how to set a property on the plugin. This would make Java 8 javadoc work much more nicely. Minor but useful!

Re: something wrong with Jenkins or something untested merged?

2014-10-21 Thread Sean Owen
Given the nature of the error, I would be really, really shocked if Java 7u71 were actually being used in the failing build, so no I do not thing the problem has to do with 7u71 per se. As I'd expect I see no changes to javac in this update from 7u65, and no chatter about crazy javac regressions.

Re: scalastyle annoys me a little bit

2014-10-24 Thread Sean Owen
On Fri, Oct 24, 2014 at 8:59 PM, Koert Kuipers ko...@tresata.com wrote: mvn clean package -DskipTests takes about 30 mins for me. thats painful since its needed for the tests. does anyone know any tricks to speed it up? (besides getting a better laptop). does zinc help? Zinc helps by about

Re: Moving PR Builder to mvn

2014-10-24 Thread Sean Owen
Here's a crude benchmark on a Linux box (GCE n1-standard-4). zinc gets the assembly build in range of SBT's time. mvn -DskipTests clean package 15:27 (start zinc) 8:18 (rebuild) 7:08 ./sbt/sbt -DskipTests clean assembly 5:10 (start zinc) 5:11 (rebuild) 5:06 The dependencies were already

Re: How to run tests properly?

2014-10-28 Thread Sean Owen
On Tue, Oct 28, 2014 at 6:18 PM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: 1. via dev/run-tests script This script executes all tests and take several hours to finish. Some tests failed but I can't say which of them. Should this really take that long? Can I specify to run only

Re: How to run tests properly?

2014-10-29 Thread Sean Owen
On Wed, Oct 29, 2014 at 6:02 PM, Niklas Wilcke 1wil...@informatik.uni-hamburg.de wrote: The core tests seems to fail because of my german locale. Some tests are locale dependend like the UtilsSuite.scala - string formatting of time durations - checks for locale dependend seperators like .

Re: How to run tests properly?

2014-10-30 Thread Sean Owen
for the failure. I tried some different configurations like [1,1,512], [2,1,1024] etc. but couldn't get the tests run without a failure. Could this be a configuration issue? On 28.10.2014 19:03, Sean Owen wrote: On Tue, Oct 28, 2014 at 6:18 PM, Niklas Wilcke 1wil...@informatik.uni

Re: matrix factorization cross validation

2014-10-30 Thread Sean Owen
MAP is effectively an average over all k from 1 to min(# recommendations, # items rated) Getting first recommendations right is more important than the last. On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das debasish.da...@gmail.com wrote: Does it make sense to have a user specific K or K is

OOM when making bins in BinaryClassificationMetrics ?

2014-11-02 Thread Sean Owen
This might be a question for Xiangrui. Recently I was using BinaryClassificationMetrics to build an AUC curve for a classifier over a reasonably large number of points (~12M). The scores were all probabilities, so tended to be almost entirely unique. The computation does some operations by key,

Re: OOM when making bins in BinaryClassificationMetrics ?

2014-11-02 Thread Sean Owen
/Partitioner.scala#L104 . Limiting the number of bins is definitely useful. Do you have time to work on it? -Xiangrui On Sun, Nov 2, 2014 at 9:34 AM, Sean Owen so...@cloudera.com wrote: This might be a question for Xiangrui. Recently I was using BinaryClassificationMetrics to build an AUC curve

Re: Hadoop configuration for checkpointing

2014-11-04 Thread Sean Owen
Let me crash this thread to suggest this *might* be related to this problem I'm trying to solve: https://issues.apache.org/jira/browse/SPARK-4196 Basically the question there is: this blank Configuration object gets made on the driver in the saveAsNewAPIHadoopFiles call, and seems to need to be

Re: Issues with AbstractParams

2014-11-04 Thread Sean Owen
I don't think it's anything to do with AbstractParams. The problem is MovieLensALS$Params, which is a case class without default constructor. It is not Serializable. However you can see it gets used in an RDD function: val ratings = sc.textFile(params.input).map { line = val fields =

JIRA + PR backlog

2014-11-06 Thread Sean Owen
(Different topic, indulge me one more reply --) Yes the number of JIRAs/PRs closed is unprecedented too and that deserves big praise. The project has stuck to making all changes and discussion in this public process, which is so powerful. Adjusted for the sheer inbound volume, Spark is doing a

Should new YARN shuffle service work with yarn-alpha?

2014-11-07 Thread Sean Owen
I noticed that this doesn't compile: mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package [error] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [error]

Re: Should new YARN shuffle service work with yarn-alpha?

2014-11-07 Thread Sean Owen
on isolating it's inclusion to only the newer YARN API's. - Patrick On Fri, Nov 7, 2014 at 11:43 PM, Sean Owen so...@cloudera.com wrote: I noticed that this doesn't compile: mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package [error] warning: [options] bootstrap class

Re: Should new YARN shuffle service work with yarn-alpha?

2014-11-08 Thread Sean Owen
Oops, that was my mistake. I moved network/shuffle into yarn, when it's just that network/yarn should be removed from yarn-alpha. That makes yarn-alpha work. I'll run tests and open a quick JIRA / PR for the change. On Sat, Nov 8, 2014 at 8:23 AM, Patrick Wendell pwend...@gmail.com wrote: This

Re: [NOTICE] [BUILD] Minor changes to Spark's build

2014-11-12 Thread Sean Owen
- Tip: when you rebase, IntelliJ will temporarily think things like the Kafka module are being removed. Say 'no' when it asks if you want to remove them. - Can we go straight to Scala 2.11.4? On Wed, Nov 12, 2014 at 5:47 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, I've just merged

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Sean Owen
LICENSE and NOTICE are fine. Signature and checksum is fine. I unzipped and built the plain source distribution, which built. However I am seeing a consistent test failure with mvn -DskipTests clean package; mvn test. In the Hive module: - SET commands semantics for a HiveContext *** FAILED ***

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-13 Thread Sean Owen
2014-11-13 11:26 GMT-08:00 Michael Armbrust mich...@databricks.com: Hey Sean, Thanks for pointing this out. Looks like a bad test where we should be doing Set comparison instead of Array. Michael On Thu, Nov 13, 2014 at 2:05 AM, Sean Owen so...@cloudera.com wrote: LICENSE and NOTICE

Re: Spark Hadoop 2.5.1

2014-11-14 Thread Sean Owen
I don't think it's necessary. You're looking at the hadoop-2.4 profile, which works with anything = 2.4. AFAIK there is no further specialization needed beyond that. The profile sets hadoop.version to 2.4.0 by default, but this can be overridden. On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet

Re: Spark Hadoop 2.5.1

2014-11-14 Thread Sean Owen
14, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote: I don't think it's necessary. You're looking at the hadoop-2.4 profile, which works with anything = 2.4. AFAIK there is no further specialization needed beyond that. The profile sets hadoop.version to 2.4.0 by default, but this can

Re: Has anyone else observed this build break?

2014-11-15 Thread Sean Owen
FWIW I do not see this on master with mvn -DskipTests clean package. I'm on OS X 10.10 and I build with Java 8 by default. On Fri, Nov 14, 2014 at 8:17 PM, Patrick Wendell pwend...@gmail.com wrote: A recent patch broke clean builds for me, I am trying to see how widespread this issue is and

Re: mvn or sbt for studying and developing Spark?

2014-11-15 Thread Sean Owen
No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100%

If first batch fails, does Streaming JobGenerator.stop() hang?

2014-11-16 Thread Sean Owen
I thought I'd ask first since there's a good chance this isn't a problem, but, I'm having a problem wherein the first batch that Spark Streaming processes fails (due to an app problem), but then, stop() blocks a very long time. This bit of JobGenerator.stop() executes, since the message appears

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Sean Owen
]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need

Re: Using sampleByKey

2014-11-18 Thread Sean Owen
I use randomSplit to make a train/CV/test set in one go. It definitely produces disjoint data sets and is efficient. The problem is you can't do it by key. I am not sure why your subtract does not work. I suspect it is because the values do not partition the same way, or they don't evaluate

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-20 Thread Sean Owen
+1 (non binding) Signatures and license looks good. I built the plain-vanilla distribution and ran tests. While I still see the Java 8 + Hive test failure, I think we've established this is ignorable. On Wed, Nov 19, 2014 at 11:51 PM, Andrew Or and...@databricks.com wrote: I will start with a

Re: How to resolve Spark site issues?

2014-11-25 Thread Sean Owen
For the interested, the SVN repo for the site is viewable at http://svn.apache.org/viewvc/spark/site/ and to check it out, you can svn co https://svn.apache.org/repos/asf/spark/site; I assume the best process is to make a diff and attach it to the JIRA. How old school. On Tue, Nov 25, 2014 at

Re: Required file not found in building

2014-12-01 Thread Sean Owen
I'm having no problems with the build or zinc on my Mac. I use zinc from brew install zinc. On Tue, Dec 2, 2014 at 3:02 AM, Stephen Boesch java...@gmail.com wrote: Mac as well. Just found the problem: I had created an alias to zinc a couple of months back. Apparently that is not happy with

Re: Can the Scala classes in the spark source code, be inherited in Java classes?

2014-12-01 Thread Sean Owen
Yes, they are compiled to classes in JVM bytecode just the same. You may find the generated code from Scala looks a bit strange and uses Scala-specific classes, but it's certainly possible to treat them like other Java classes. On Tue, Dec 2, 2014 at 5:22 AM, Niranda Perera nira...@wso2.com

Re: zinc invocation examples

2014-12-04 Thread Sean Owen
You just run it once with zinc -start and leave it running as a background process on your build machine. You don't have to do anything for each build. On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote:

Re: Protobuf version in mvn vs sbt

2014-12-05 Thread Sean Owen
(Nit: CDH *5.1.x*, including 5.1.3, is derived from Hadoop 2.3.x. 5.3 is based on 2.5.x) On Fri, Dec 5, 2014 at 3:29 PM, DB Tsai dbt...@dbtsai.com wrote: As Marcelo said, CDH5.3 is based on hadoop 2.3, so please try - To

Is this a little bug in BlockTransferMessage ?

2014-12-09 Thread Sean Owen
https://github.com/apache/spark/blob/master/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/BlockTransferMessage.java#L70 public byte[] toByteArray() { ByteBuf buf = Unpooled.buffer(encodedLength()); buf.writeByte(type().id); encode(buf); assert buf.writableBytes()

  1   2   3   4   5   6   7   8   9   10   >