Regarding RecordReader of spark

2014-11-16 Thread Vibhanshu Prasad
Hello Everyone, I am going through the source code of rdd and Record readers There are found 2 classes 1. WholeTextFileRecordReader 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) The description of both the classes is perfectly similar. I am not able to understand why we

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Dinesh J. Weerakkody
Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody

send currentJars and currentFiles to exetutor with actor?

2014-11-16 Thread scwf
I notice that spark serialize each task with the dependencies (files and JARs added to the SparkContext) , def serializeWithDependencies( task: Task[_], currentFiles: HashMap[String, Long], currentJars: HashMap[String, Long], serializer: SerializerInstance) :

If first batch fails, does Streaming JobGenerator.stop() hang?

2014-11-16 Thread Sean Owen
I thought I'd ask first since there's a good chance this isn't a problem, but, I'm having a problem wherein the first batch that Spark Streaming processes fails (due to an app problem), but then, stop() blocks a very long time. This bit of JobGenerator.stop() executes, since the message appears

Re: send currentJars and currentFiles to exetutor with actor?

2014-11-16 Thread Reynold Xin
The current design is not ideal, but the size of dependencies should be fairly small since we only send the path and timestamp, not the jars themselves. Executors can come and go. This is essentially a state replication problem that you gotta be very careful with consistency. On Sun, Nov 16,

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Michael Armbrust
I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Sean Owen
Yeah, my comment was mostly reflecting the fact that mvn is what creates the releases and is the 'build of reference', from which the SBT build is generated. The docs were recently changed to suggest that Maven is the default build and SBT is for advanced users. I find Maven plays nicer with IDEs,

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Stephen Boesch
HI Michael, That insight is useful. Some thoughts: * I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Patrick Wendell
Neither is strictly optimal which is why we ended up supporting both. Our reference build for packaging is Maven so you are less likely to run into unexpected dependency issues, etc. Many developers use sbt as well. It's somewhat religion and the best thing might be to try both and see which you

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
Ok, strictly speaking, that's equivalent to your second class of examples, development console, not the first sbt console On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com wrote: The console mode of sbt (just run sbt/sbt and then a long running console session is started

Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread jay vyas
This is more a curiosity than an immediate problem. Here is my question: I ran into this easily solved issue http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou recently. The solution was to replace my class with a scala

Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Reynold Xin
That's a great idea and it is also a pain point for some users. However, it is not possible to solve this problem at compile time, because the content of serialization can only be determined at runtime. There are some efforts in Scala to help users avoid mistakes like this. One example project

Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Andrew Ash
Hi Jay, I just came across SPARK-720 Statically guarantee serialization will succeed https://issues.apache.org/jira/browse/SPARK-720 which sounds like exactly what you're referring to. Like Reynold I think it's not possible at this time but it would be good to get your feedback on that ticket.

Re: Regarding RecordReader of spark

2014-11-16 Thread Reynold Xin
I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add more documentation on the whole thing? Thanks. On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com wrote:

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-16 Thread Josh Rosen
-1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive

Re: Regarding RecordReader of spark

2014-11-16 Thread Andrew Ash
Filed as https://issues.apache.org/jira/browse/SPARK-4437 On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote: I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add

Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-16 Thread Kousuke Saruta
Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014

re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Yiming (John) Zhang
Hi Dinesh, Sean, Michael, Stephen, Mark, and Patrick Thank you for your reply and discussions. So the conclusion is that mvn is preferred when packaging and distribution, while sbt is better for development. This also explains why the compilation tool of make-distribution.sh changed from

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
More or less correct, but I'd add that there are an awful lot of software systems out there that use Maven. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as legacy, Michael :) What that

Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-16 Thread slcclimber
Ashutosh, The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets. A better approach would be to use some thing along these lines: val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size) val rddWithIndex =