Regarding RecordReader of spark
Hello Everyone, I am going through the source code of rdd and Record readers There are found 2 classes 1. WholeTextFileRecordReader 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) The description of both the classes is perfectly similar. I am not able to understand why we have 2 classes. Is CombineFileRecordReader providing some extra advantage? Regards Vibhanshu
Re: mvn or sbt for studying and developing Spark?
Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody*
send currentJars and currentFiles to exetutor with actor?
I notice that spark serialize each task with the dependencies (files and JARs added to the SparkContext) , def serializeWithDependencies( task: Task[_], currentFiles: HashMap[String, Long], currentJars: HashMap[String, Long], serializer: SerializerInstance) : ByteBuffer = { val out = new ByteArrayOutputStream(4096) val dataOut = new DataOutputStream(out) // Write currentFiles dataOut.writeInt(currentFiles.size) for ((name, timestamp) - currentFiles) { dataOut.writeUTF(name) dataOut.writeLong(timestamp) } // Write currentJars dataOut.writeInt(currentJars.size) for ((name, timestamp) - currentJars) { dataOut.writeUTF(name) dataOut.writeLong(timestamp) } // Write the task itself and finish dataOut.flush() val taskBytes = serializer.serialize(task).array() out.write(taskBytes) ByteBuffer.wrap(out.toByteArray) } Why not send currentJars and currentFiles to exetutor using actor? I think it's not necessary to serialize them for each task. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/send-currentJars-and-currentFiles-to-exetutor-with-actor-tp9381.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
If first batch fails, does Streaming JobGenerator.stop() hang?
I thought I'd ask first since there's a good chance this isn't a problem, but, I'm having a problem wherein the first batch that Spark Streaming processes fails (due to an app problem), but then, stop() blocks a very long time. This bit of JobGenerator.stop() executes, since the message appears in the logs: def haveAllBatchesBeenProcessed = { lastProcessedBatch != null lastProcessedBatch.milliseconds == stopTime } logInfo(Waiting for jobs to be processed and checkpoints to be written) while (!hasTimedOut !haveAllBatchesBeenProcessed) { Thread.sleep(pollTime) } // ... 10x batch duration wait here, before seeing the next line log: logInfo(Waited for jobs to be processed and checkpoints to be written) I think that lastProcessedBatch is always null since no batch ever succeeds. Of course, for all this code knows, the next batch might succeed and so is there waiting for it. But it should proceed after one more batch completes, even if it failed? JobGenerator.onBatchCompleted is only called for a successful batch. Can it be called if it fails too? I think that would fix it. Should the condition also not be lastProcessedBatch.milliseconds = stopTime instead of == ? Thanks for any pointers. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: send currentJars and currentFiles to exetutor with actor?
The current design is not ideal, but the size of dependencies should be fairly small since we only send the path and timestamp, not the jars themselves. Executors can come and go. This is essentially a state replication problem that you gotta be very careful with consistency. On Sun, Nov 16, 2014 at 4:24 AM, scwf wangf...@huawei.com wrote: I notice that spark serialize each task with the dependencies (files and JARs added to the SparkContext) , def serializeWithDependencies( task: Task[_], currentFiles: HashMap[String, Long], currentJars: HashMap[String, Long], serializer: SerializerInstance) : ByteBuffer = { val out = new ByteArrayOutputStream(4096) val dataOut = new DataOutputStream(out) // Write currentFiles dataOut.writeInt(currentFiles.size) for ((name, timestamp) - currentFiles) { dataOut.writeUTF(name) dataOut.writeLong(timestamp) } // Write currentJars dataOut.writeInt(currentJars.size) for ((name, timestamp) - currentJars) { dataOut.writeUTF(name) dataOut.writeLong(timestamp) } // Write the task itself and finish dataOut.flush() val taskBytes = serializer.serialize(task).array() out.write(taskBytes) ByteBuffer.wrap(out.toByteArray) } Why not send currentJars and currentFiles to exetutor using actor? I think it's not necessary to serialize them for each task. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/send-currentJars-and-currentFiles-to-exetutor-with-actor-tp9381.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: mvn or sbt for studying and developing Spark?
I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only *tab* -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody*
Re: mvn or sbt for studying and developing Spark?
Yeah, my comment was mostly reflecting the fact that mvn is what creates the releases and is the 'build of reference', from which the SBT build is generated. The docs were recently changed to suggest that Maven is the default build and SBT is for advanced users. I find Maven plays nicer with IDEs, or at least, IntelliJ. SBT is faster for incremental compilation and better for anyone who knows and can leverage SBT's model. If someone's new to it all, I dunno, they're likelier to have fewer problems using Maven to start? YMMV. On Sun, Nov 16, 2014 at 9:23 PM, Michael Armbrust mich...@databricks.com wrote: I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only tab -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody* - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: mvn or sbt for studying and developing Spark?
HI Michael, That insight is useful. Some thoughts: * I moved from sbt to maven in June specifically due to Andrew Or's describing mvn as the default build tool. Developers should keep in mind that jenkins uses mvn so we need to run mvn before submitting PR's - even if sbt were used for day to day dev work * In addition, as Sean has alluded to, the Intellij seems to comprehend the maven builds a bit more readily than sbt * But for command line and day to day dev purposes: sbt sounds great to use Those sound bites you provided about exposing built-in test databases for hive and for displaying available testcases are sweet. Any easy/convenient way to see more of those kinds of facilities available through sbt ? 2014-11-16 13:23 GMT-08:00 Michael Armbrust mich...@databricks.com: I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only *tab* -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody*
Re: mvn or sbt for studying and developing Spark?
The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. We include the scala-maven-plugin in spark/pom.xml, so equivalent functionality is available using Maven. You can start a console session with `mvn scala:console`. On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust mich...@databricks.com wrote: I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only *tab* -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody*
Re: mvn or sbt for studying and developing Spark?
Neither is strictly optimal which is why we ended up supporting both. Our reference build for packaging is Maven so you are less likely to run into unexpected dependency issues, etc. Many developers use sbt as well. It's somewhat religion and the best thing might be to try both and see which you prefer. - Patrick On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com wrote: The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. We include the scala-maven-plugin in spark/pom.xml, so equivalent functionality is available using Maven. You can start a console session with `mvn scala:console`. On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust mich...@databricks.com wrote: I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only *tab* -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J.
Re: mvn or sbt for studying and developing Spark?
Ok, strictly speaking, that's equivalent to your second class of examples, development console, not the first sbt console On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com wrote: The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. We include the scala-maven-plugin in spark/pom.xml, so equivalent functionality is available using Maven. You can start a console session with `mvn scala:console`. On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust mich...@databricks.com wrote: I'm going to have to disagree here. If you are building a release distribution or integrating with legacy systems then maven is probably the correct choice. However most of the core developers that I know use sbt, and I think its a better choice for exploration and development overall. That said, this probably falls into the category of a religious argument so you might want to look at both options and decide for yourself. In my experience the SBT build is significantly faster with less effort (and I think sbt is still faster even if you go through the extra effort of installing zinc) and easier to read. The console mode of sbt (just run sbt/sbt and then a long running console session is started that will accept further commands) is great for building individual subprojects or running single test suites. In addition to being faster since its a long running JVM, its got a lot of nice features like tab-completion for test case names. For example, if I wanted to see what test cases are available in the SQL subproject you can do the following: [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt [info] Loading project definition from /Users/marmbrus/workspace/spark/project/project [info] Loading project definition from /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project [info] Set current project to spark-parent (in build file:/Users/marmbrus/workspace/spark/) sql/test-only *tab* -- org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.DataTypeSuite org.apache.spark.sql.DslQuerySuite org.apache.spark.sql.InsertIntoSuite ... Another very useful feature is the development console, which starts an interactive REPL including the most recent version of the code and a lot of useful imports for some subprojects. For example in the hive subproject it automatically sets up a temporary database with a bunch of test data pre-loaded: $ sbt/sbt hive/console hive/console ... import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive.test.TestHive._ import org.apache.spark.sql.parquet.ParquetTestData Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45). Type in expressions to have them evaluated. Type :help for more information. scala sql(SELECT * FROM src).take(2) res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86]) Michael On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Stephen and Sean, Thanks for correction. On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote: No, the Maven build is the main one. I would use it unless you have a need to use the SBT build in particular. On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody dineshjweerakk...@gmail.com wrote: Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, *Dinesh J. Weerakkody* -- Thanks Best Regards, *Dinesh J. Weerakkody*
Is there a way for scala compiler to catch unserializable app code?
This is more a curiosity than an immediate problem. Here is my question: I ran into this easily solved issue http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou recently. The solution was to replace my class with a scala singleton, which i guess is readily serializable. So its clear that spark needs to serialize objects which carry the driver methods for an app, in order to run... but I'm wondering,,, maybe there is a way to change or update the spark API to catch unserializable spark apps at compile time? -- jay vyas
Re: Is there a way for scala compiler to catch unserializable app code?
That's a great idea and it is also a pain point for some users. However, it is not possible to solve this problem at compile time, because the content of serialization can only be determined at runtime. There are some efforts in Scala to help users avoid mistakes like this. One example project that is more researchy is Spore: http://docs.scala-lang.org/sips/pending/spores.html On Sun, Nov 16, 2014 at 4:12 PM, jay vyas jayunit100.apa...@gmail.com wrote: This is more a curiosity than an immediate problem. Here is my question: I ran into this easily solved issue http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou recently. The solution was to replace my class with a scala singleton, which i guess is readily serializable. So its clear that spark needs to serialize objects which carry the driver methods for an app, in order to run... but I'm wondering,,, maybe there is a way to change or update the spark API to catch unserializable spark apps at compile time? -- jay vyas
Re: Is there a way for scala compiler to catch unserializable app code?
Hi Jay, I just came across SPARK-720 Statically guarantee serialization will succeed https://issues.apache.org/jira/browse/SPARK-720 which sounds like exactly what you're referring to. Like Reynold I think it's not possible at this time but it would be good to get your feedback on that ticket. Andrew On Sun, Nov 16, 2014 at 4:37 PM, Reynold Xin r...@databricks.com wrote: That's a great idea and it is also a pain point for some users. However, it is not possible to solve this problem at compile time, because the content of serialization can only be determined at runtime. There are some efforts in Scala to help users avoid mistakes like this. One example project that is more researchy is Spore: http://docs.scala-lang.org/sips/pending/spores.html On Sun, Nov 16, 2014 at 4:12 PM, jay vyas jayunit100.apa...@gmail.com wrote: This is more a curiosity than an immediate problem. Here is my question: I ran into this easily solved issue http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou recently. The solution was to replace my class with a scala singleton, which i guess is readily serializable. So its clear that spark needs to serialize objects which carry the driver methods for an app, in order to run... but I'm wondering,,, maybe there is a way to change or update the spark API to catch unserializable spark apps at compile time? -- jay vyas
Re: Regarding RecordReader of spark
I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add more documentation on the whole thing? Thanks. On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com wrote: Hello Everyone, I am going through the source code of rdd and Record readers There are found 2 classes 1. WholeTextFileRecordReader 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) The description of both the classes is perfectly similar. I am not able to understand why we have 2 classes. Is CombineFileRecordReader providing some extra advantage? Regards Vibhanshu
Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
-1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Regarding RecordReader of spark
Filed as https://issues.apache.org/jira/browse/SPARK-4437 On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote: I don't think the code is immediately obvious. Davies - I think you added the code, and Josh reviewed it. Can you guys explain and maybe submit a patch to add more documentation on the whole thing? Thanks. On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com wrote: Hello Everyone, I am going through the source code of rdd and Record readers There are found 2 classes 1. WholeTextFileRecordReader 2. WholeCombineFileRecordReader ( extends CombineFileRecordReader ) The description of both the classes is perfectly similar. I am not able to understand why we have 2 classes. Is CombineFileRecordReader providing some extra advantage? Regards Vibhanshu
Re: [VOTE] Release Apache Spark 1.1.1 (RC1)
Now I've finished to revert for SPARK-4434 and opened PR. (2014/11/16 17:08), Josh Rosen wrote: -1 I found a potential regression in 1.1.1 related to spark-submit and cluster deploy mode: https://issues.apache.org/jira/browse/SPARK-4434 I think that this is worth fixing. On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote: +1 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are fixed. Hive version inspection works as expected. On 11/15/14 8:25 AM, Zach Fry wrote: +0 I expect to start testing on Monday but won't have enough results to change my vote from +0 until Monday night or Tuesday morning. Thanks, Zach -- View this message in context: http://apache-spark- developers-list.1001551.n3.nabble.com/VOTE-Release- Apache-Spark-1-1-1-RC1-tp9311p9370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
re: mvn or sbt for studying and developing Spark?
Hi Dinesh, Sean, Michael, Stephen, Mark, and Patrick Thank you for your reply and discussions. So the conclusion is that mvn is preferred when packaging and distribution, while sbt is better for development. This also explains why the compilation tool of make-distribution.sh changed from sbt (in spark-0.9) to mvn(in spark-1.0). Cheers, Yiming 发件人: Dinesh J. Weerakkody [mailto:dineshjweerakk...@gmail.com] 发送时间: 2014年11月16日 10:58 收件人: sdi...@gmail.com 抄送: dev@spark.apache.org 主题: Re: mvn or sbt for studying and developing Spark? Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com mailto:sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, Dinesh J. Weerakkody
Re: mvn or sbt for studying and developing Spark?
More or less correct, but I'd add that there are an awful lot of software systems out there that use Maven. Integrating with those systems is generally easier if you are also working with Spark in Maven. (And I wouldn't classify all of those Maven-built systems as legacy, Michael :) What that ends up meaning is that if you are working *on* Spark, then SBT can be more convenient and productive; but if you are working *with* Spark along with other significant pieces of software, then using Maven can be the better approach. On Sun, Nov 16, 2014 at 6:11 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi Dinesh, Sean, Michael, Stephen, Mark, and Patrick Thank you for your reply and discussions. So the conclusion is that mvn is preferred when packaging and distribution, while sbt is better for development. This also explains why the compilation tool of make-distribution.sh changed from sbt (in spark-0.9) to mvn(in spark-1.0). Cheers, Yiming 发件人: Dinesh J. Weerakkody [mailto:dineshjweerakk...@gmail.com] 发送时间: 2014年11月16日 10:58 收件人: sdi...@gmail.com 抄送: dev@spark.apache.org 主题: Re: mvn or sbt for studying and developing Spark? Hi Yiming, I believe that both SBT and MVN is supported in SPARK, but SBT is preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some build failures. After that used SBT and works fine. You can go through these discussions regarding SBT vs MVN and learn pros and cons of both [1] [2]. [1] http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html [2] https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ Thanks, On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com mailto:sdi...@gmail.com wrote: Hi, I am new in developing Spark and my current focus is about co-scheduling of spark tasks. However, I am confused with the building tools: sometimes the documentation uses mvn but sometimes uses sbt. So, my question is that which one is the preferred tool of Spark community? And what's the technical difference between them? Thank you! Cheers, Yiming -- Thanks Best Regards, Dinesh J. Weerakkody
Re: [MLlib] Contributing Algorithm for Outlier Detection
Ashutosh, The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets. A better approach would be to use some thing along these lines: val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size) val rddWithIndex = rdd.zip(index) Which zips the two RDD's in a parallelizable fashion. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9399.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org