Regarding RecordReader of spark

2014-11-16 Thread Vibhanshu Prasad
Hello Everyone,

I am going through the source code of rdd and Record readers
There are found 2 classes

1. WholeTextFileRecordReader
2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )

The description of both the classes is perfectly similar.

I am not able to understand why we have 2 classes. Is
CombineFileRecordReader providing some extra advantage?

Regards
Vibhanshu


Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Dinesh J. Weerakkody
Hi Stephen and Sean,

Thanks for correction.

On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:

 No, the Maven build is the main one.  I would use it unless you have a
 need to use the SBT build in particular.
 On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
 dineshjweerakk...@gmail.com wrote:

 Hi Yiming,

 I believe that both SBT and MVN is supported in SPARK, but SBT is
 preferred
 (I'm not 100% sure about this :) ). When I'm using MVN I got some build
 failures. After that used SBT and works fine.

 You can go through these discussions regarding SBT vs MVN and learn pros
 and cons of both [1] [2].

 [1]

 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html

 [2]

 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ

 Thanks,

 On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com
 wrote:

  Hi,
 
 
 
  I am new in developing Spark and my current focus is about
 co-scheduling of
  spark tasks. However, I am confused with the building tools: sometimes
 the
  documentation uses mvn but sometimes uses sbt.
 
 
 
  So, my question is that which one is the preferred tool of Spark
 community?
  And what's the technical difference between them? Thank you!
 
 
 
  Cheers,
 
  Yiming
 
 


 --
 Thanks  Best Regards,

 *Dinesh J. Weerakkody*




-- 
Thanks  Best Regards,

*Dinesh J. Weerakkody*


send currentJars and currentFiles to exetutor with actor?

2014-11-16 Thread scwf
I notice that spark serialize each task with the dependencies (files and JARs
added to the SparkContext) , 
  def serializeWithDependencies(
  task: Task[_],
  currentFiles: HashMap[String, Long],
  currentJars: HashMap[String, Long],
  serializer: SerializerInstance)
: ByteBuffer = {

val out = new ByteArrayOutputStream(4096)
val dataOut = new DataOutputStream(out)

// Write currentFiles
dataOut.writeInt(currentFiles.size)
for ((name, timestamp) - currentFiles) {
  dataOut.writeUTF(name)
  dataOut.writeLong(timestamp)
}

// Write currentJars
dataOut.writeInt(currentJars.size)
for ((name, timestamp) - currentJars) {
  dataOut.writeUTF(name)
  dataOut.writeLong(timestamp)
}

// Write the task itself and finish
dataOut.flush()
val taskBytes = serializer.serialize(task).array()
out.write(taskBytes)
ByteBuffer.wrap(out.toByteArray)
  }

Why not send currentJars and currentFiles to exetutor using actor? I think
it's not necessary to serialize them for each task. 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/send-currentJars-and-currentFiles-to-exetutor-with-actor-tp9381.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



If first batch fails, does Streaming JobGenerator.stop() hang?

2014-11-16 Thread Sean Owen
I thought I'd ask first since there's a good chance this isn't a
problem, but, I'm having a problem wherein the first batch that Spark
Streaming processes fails (due to an app problem), but then, stop()
blocks a very long time.

This bit of JobGenerator.stop() executes, since the message appears in the logs:


def haveAllBatchesBeenProcessed = {
  lastProcessedBatch != null  lastProcessedBatch.milliseconds == stopTime
}
logInfo(Waiting for jobs to be processed and checkpoints to be written)
while (!hasTimedOut  !haveAllBatchesBeenProcessed) {
  Thread.sleep(pollTime)
}

// ... 10x batch duration wait here, before seeing the next line log:

logInfo(Waited for jobs to be processed and checkpoints to be written)


I think that lastProcessedBatch is always null since no batch ever
succeeds. Of course, for all this code knows, the next batch might
succeed and so is there waiting for it. But it should proceed after
one more batch completes, even if it failed?

JobGenerator.onBatchCompleted is only called for a successful batch.
Can it be called if it fails too? I think that would fix it.

Should the condition also not be lastProcessedBatch.milliseconds =
stopTime instead of == ?

Thanks for any pointers.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: send currentJars and currentFiles to exetutor with actor?

2014-11-16 Thread Reynold Xin
The current design is not ideal, but the size of dependencies should be
fairly small since we only send the path and timestamp, not the jars
themselves.

Executors can come and go. This is essentially a state replication problem
that you gotta be very careful with consistency.

On Sun, Nov 16, 2014 at 4:24 AM, scwf wangf...@huawei.com wrote:

 I notice that spark serialize each task with the dependencies (files and
 JARs
 added to the SparkContext) ,
   def serializeWithDependencies(
   task: Task[_],
   currentFiles: HashMap[String, Long],
   currentJars: HashMap[String, Long],
   serializer: SerializerInstance)
 : ByteBuffer = {

 val out = new ByteArrayOutputStream(4096)
 val dataOut = new DataOutputStream(out)

 // Write currentFiles
 dataOut.writeInt(currentFiles.size)
 for ((name, timestamp) - currentFiles) {
   dataOut.writeUTF(name)
   dataOut.writeLong(timestamp)
 }

 // Write currentJars
 dataOut.writeInt(currentJars.size)
 for ((name, timestamp) - currentJars) {
   dataOut.writeUTF(name)
   dataOut.writeLong(timestamp)
 }

 // Write the task itself and finish
 dataOut.flush()
 val taskBytes = serializer.serialize(task).array()
 out.write(taskBytes)
 ByteBuffer.wrap(out.toByteArray)
   }

 Why not send currentJars and currentFiles to exetutor using actor? I think
 it's not necessary to serialize them for each task.



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/send-currentJars-and-currentFiles-to-exetutor-with-actor-tp9381.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Michael Armbrust
I'm going to have to disagree here.  If you are building a release
distribution or integrating with legacy systems then maven is probably the
correct choice.  However most of the core developers that I know use sbt,
and I think its a better choice for exploration and development overall.
That said, this probably falls into the category of a religious argument so
you might want to look at both options and decide for yourself.

In my experience the SBT build is significantly faster with less effort
(and I think sbt is still faster even if you go through the extra effort of
installing zinc) and easier to read.  The console mode of sbt (just run
sbt/sbt and then a long running console session is started that will accept
further commands) is great for building individual subprojects or running
single test suites.  In addition to being faster since its a long running
JVM, its got a lot of nice features like tab-completion for test case names.

For example, if I wanted to see what test cases are available in the SQL
subproject you can do the following:

[marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
[info] Loading project definition from
/Users/marmbrus/workspace/spark/project/project
[info] Loading project definition from
/Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[info] Set current project to spark-parent (in build
file:/Users/marmbrus/workspace/spark/)
 sql/test-only *tab*
--
 org.apache.spark.sql.CachedTableSuite
org.apache.spark.sql.DataTypeSuite
 org.apache.spark.sql.DslQuerySuite
org.apache.spark.sql.InsertIntoSuite
...

Another very useful feature is the development console, which starts an
interactive REPL including the most recent version of the code and a lot of
useful imports for some subprojects.  For example in the hive subproject it
automatically sets up a temporary database with a bunch of test data
pre-loaded:

$ sbt/sbt hive/console
 hive/console
...
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.test.TestHive._
import org.apache.spark.sql.parquet.ParquetTestData
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_45).
Type in expressions to have them evaluated.
Type :help for more information.

scala sql(SELECT * FROM src).take(2)
res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

Michael

On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody 
dineshjweerakk...@gmail.com wrote:

 Hi Stephen and Sean,

 Thanks for correction.

 On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:

  No, the Maven build is the main one.  I would use it unless you have a
  need to use the SBT build in particular.
  On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
  dineshjweerakk...@gmail.com wrote:
 
  Hi Yiming,
 
  I believe that both SBT and MVN is supported in SPARK, but SBT is
  preferred
  (I'm not 100% sure about this :) ). When I'm using MVN I got some build
  failures. After that used SBT and works fine.
 
  You can go through these discussions regarding SBT vs MVN and learn pros
  and cons of both [1] [2].
 
  [1]
 
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
 
  [2]
 
 
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
 
  Thanks,
 
  On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com
  wrote:
 
   Hi,
  
  
  
   I am new in developing Spark and my current focus is about
  co-scheduling of
   spark tasks. However, I am confused with the building tools: sometimes
  the
   documentation uses mvn but sometimes uses sbt.
  
  
  
   So, my question is that which one is the preferred tool of Spark
  community?
   And what's the technical difference between them? Thank you!
  
  
  
   Cheers,
  
   Yiming
  
  
 
 
  --
  Thanks  Best Regards,
 
  *Dinesh J. Weerakkody*
 
 


 --
 Thanks  Best Regards,

 *Dinesh J. Weerakkody*



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Sean Owen
Yeah, my comment was mostly reflecting the fact that mvn is what
creates the releases and is the 'build of reference', from which the
SBT build is generated. The docs were recently changed to suggest that
Maven is the default build and SBT is for advanced users. I find Maven
plays nicer with IDEs, or at least, IntelliJ.

SBT is faster for incremental compilation and better for anyone who
knows and can leverage SBT's model.

If someone's new to it all, I dunno, they're likelier to have fewer
problems using Maven to start? YMMV.

On Sun, Nov 16, 2014 at 9:23 PM, Michael Armbrust
mich...@databricks.com wrote:
 I'm going to have to disagree here.  If you are building a release
 distribution or integrating with legacy systems then maven is probably the
 correct choice.  However most of the core developers that I know use sbt,
 and I think its a better choice for exploration and development overall.
 That said, this probably falls into the category of a religious argument so
 you might want to look at both options and decide for yourself.

 In my experience the SBT build is significantly faster with less effort (and
 I think sbt is still faster even if you go through the extra effort of
 installing zinc) and easier to read.  The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case names.

 For example, if I wanted to see what test cases are available in the SQL
 subproject you can do the following:

 [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
 [info] Loading project definition from
 /Users/marmbrus/workspace/spark/project/project
 [info] Loading project definition from
 /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
 [info] Set current project to spark-parent (in build
 file:/Users/marmbrus/workspace/spark/)
 sql/test-only tab
 --
 org.apache.spark.sql.CachedTableSuite
 org.apache.spark.sql.DataTypeSuite
 org.apache.spark.sql.DslQuerySuite
 org.apache.spark.sql.InsertIntoSuite
 ...

 Another very useful feature is the development console, which starts an
 interactive REPL including the most recent version of the code and a lot of
 useful imports for some subprojects.  For example in the hive subproject it
 automatically sets up a temporary database with a bunch of test data
 pre-loaded:

 $ sbt/sbt hive/console
 hive/console
 ...
 import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive.test.TestHive._
 import org.apache.spark.sql.parquet.ParquetTestData
 Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_45).
 Type in expressions to have them evaluated.
 Type :help for more information.

 scala sql(SELECT * FROM src).take(2)
 res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

 Michael

 On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody
 dineshjweerakk...@gmail.com wrote:

 Hi Stephen and Sean,

 Thanks for correction.

 On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:

  No, the Maven build is the main one.  I would use it unless you have a
  need to use the SBT build in particular.
  On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
  dineshjweerakk...@gmail.com wrote:
 
  Hi Yiming,
 
  I believe that both SBT and MVN is supported in SPARK, but SBT is
  preferred
  (I'm not 100% sure about this :) ). When I'm using MVN I got some build
  failures. After that used SBT and works fine.
 
  You can go through these discussions regarding SBT vs MVN and learn
  pros
  and cons of both [1] [2].
 
  [1]
 
 
  http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
 
  [2]
 
 
  https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
 
  Thanks,
 
  On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com
  wrote:
 
   Hi,
  
  
  
   I am new in developing Spark and my current focus is about
  co-scheduling of
   spark tasks. However, I am confused with the building tools:
   sometimes
  the
   documentation uses mvn but sometimes uses sbt.
  
  
  
   So, my question is that which one is the preferred tool of Spark
  community?
   And what's the technical difference between them? Thank you!
  
  
  
   Cheers,
  
   Yiming
  
  
 
 
  --
  Thanks  Best Regards,
 
  *Dinesh J. Weerakkody*
 
 


 --
 Thanks  Best Regards,

 *Dinesh J. Weerakkody*



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Stephen Boesch
HI Michael,
 That insight is useful.   Some thoughts:

* I moved from sbt to maven in June specifically due to Andrew Or's
describing mvn as the default build tool.  Developers should keep in mind
that jenkins uses mvn so we need to run mvn before submitting PR's - even
if sbt were used for day to day dev work
*  In addition, as Sean has alluded to, the Intellij seems to comprehend
the maven builds a bit more readily than sbt
* But for command line and day to day dev purposes:  sbt sounds great to
use  Those sound bites you provided about exposing built-in test databases
for hive and for displaying available testcases are sweet.  Any
easy/convenient way to see more of  those kinds of facilities available
through sbt ?


2014-11-16 13:23 GMT-08:00 Michael Armbrust mich...@databricks.com:

 I'm going to have to disagree here.  If you are building a release
 distribution or integrating with legacy systems then maven is probably the
 correct choice.  However most of the core developers that I know use sbt,
 and I think its a better choice for exploration and development overall.
 That said, this probably falls into the category of a religious argument so
 you might want to look at both options and decide for yourself.

 In my experience the SBT build is significantly faster with less effort
 (and I think sbt is still faster even if you go through the extra effort of
 installing zinc) and easier to read.  The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.

 For example, if I wanted to see what test cases are available in the SQL
 subproject you can do the following:

 [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
 [info] Loading project definition from
 /Users/marmbrus/workspace/spark/project/project
 [info] Loading project definition from

 /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
 [info] Set current project to spark-parent (in build
 file:/Users/marmbrus/workspace/spark/)
  sql/test-only *tab*
 --
  org.apache.spark.sql.CachedTableSuite
 org.apache.spark.sql.DataTypeSuite
  org.apache.spark.sql.DslQuerySuite
 org.apache.spark.sql.InsertIntoSuite
 ...

 Another very useful feature is the development console, which starts an
 interactive REPL including the most recent version of the code and a lot of
 useful imports for some subprojects.  For example in the hive subproject it
 automatically sets up a temporary database with a bunch of test data
 pre-loaded:

 $ sbt/sbt hive/console
  hive/console
 ...
 import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive.test.TestHive._
 import org.apache.spark.sql.parquet.ParquetTestData
 Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_45).
 Type in expressions to have them evaluated.
 Type :help for more information.

 scala sql(SELECT * FROM src).take(2)
 res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

 Michael

 On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody 
 dineshjweerakk...@gmail.com wrote:

  Hi Stephen and Sean,
 
  Thanks for correction.
 
  On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:
 
   No, the Maven build is the main one.  I would use it unless you have a
   need to use the SBT build in particular.
   On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
   dineshjweerakk...@gmail.com wrote:
  
   Hi Yiming,
  
   I believe that both SBT and MVN is supported in SPARK, but SBT is
   preferred
   (I'm not 100% sure about this :) ). When I'm using MVN I got some
 build
   failures. After that used SBT and works fine.
  
   You can go through these discussions regarding SBT vs MVN and learn
 pros
   and cons of both [1] [2].
  
   [1]
  
  
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
  
   [2]
  
  
 
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
  
   Thanks,
  
   On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang 
 sdi...@gmail.com
   wrote:
  
Hi,
   
   
   
I am new in developing Spark and my current focus is about
   co-scheduling of
spark tasks. However, I am confused with the building tools:
 sometimes
   the
documentation uses mvn but sometimes uses sbt.
   
   
   
So, my question is that which one is the preferred tool of Spark
   community?
And what's the technical difference between them? Thank you!
   
   
   
Cheers,
   
Yiming
   
   
  
  
   --
   Thanks  Best Regards,
  
   *Dinesh J. Weerakkody*
  
  
 
 
  --
  Thanks  Best Regards,
 
  *Dinesh J. Weerakkody*
 



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra

 The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.


We include the scala-maven-plugin in spark/pom.xml, so equivalent
functionality is available using Maven.  You can start a console session
with `mvn scala:console`.


On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust mich...@databricks.com
wrote:

 I'm going to have to disagree here.  If you are building a release
 distribution or integrating with legacy systems then maven is probably the
 correct choice.  However most of the core developers that I know use sbt,
 and I think its a better choice for exploration and development overall.
 That said, this probably falls into the category of a religious argument so
 you might want to look at both options and decide for yourself.

 In my experience the SBT build is significantly faster with less effort
 (and I think sbt is still faster even if you go through the extra effort of
 installing zinc) and easier to read.  The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.

 For example, if I wanted to see what test cases are available in the SQL
 subproject you can do the following:

 [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
 [info] Loading project definition from
 /Users/marmbrus/workspace/spark/project/project
 [info] Loading project definition from

 /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
 [info] Set current project to spark-parent (in build
 file:/Users/marmbrus/workspace/spark/)
  sql/test-only *tab*
 --
  org.apache.spark.sql.CachedTableSuite
 org.apache.spark.sql.DataTypeSuite
  org.apache.spark.sql.DslQuerySuite
 org.apache.spark.sql.InsertIntoSuite
 ...

 Another very useful feature is the development console, which starts an
 interactive REPL including the most recent version of the code and a lot of
 useful imports for some subprojects.  For example in the hive subproject it
 automatically sets up a temporary database with a bunch of test data
 pre-loaded:

 $ sbt/sbt hive/console
  hive/console
 ...
 import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive.test.TestHive._
 import org.apache.spark.sql.parquet.ParquetTestData
 Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_45).
 Type in expressions to have them evaluated.
 Type :help for more information.

 scala sql(SELECT * FROM src).take(2)
 res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

 Michael

 On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody 
 dineshjweerakk...@gmail.com wrote:

  Hi Stephen and Sean,
 
  Thanks for correction.
 
  On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:
 
   No, the Maven build is the main one.  I would use it unless you have a
   need to use the SBT build in particular.
   On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
   dineshjweerakk...@gmail.com wrote:
  
   Hi Yiming,
  
   I believe that both SBT and MVN is supported in SPARK, but SBT is
   preferred
   (I'm not 100% sure about this :) ). When I'm using MVN I got some
 build
   failures. After that used SBT and works fine.
  
   You can go through these discussions regarding SBT vs MVN and learn
 pros
   and cons of both [1] [2].
  
   [1]
  
  
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
  
   [2]
  
  
 
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
  
   Thanks,
  
   On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang 
 sdi...@gmail.com
   wrote:
  
Hi,
   
   
   
I am new in developing Spark and my current focus is about
   co-scheduling of
spark tasks. However, I am confused with the building tools:
 sometimes
   the
documentation uses mvn but sometimes uses sbt.
   
   
   
So, my question is that which one is the preferred tool of Spark
   community?
And what's the technical difference between them? Thank you!
   
   
   
Cheers,
   
Yiming
   
   
  
  
   --
   Thanks  Best Regards,
  
   *Dinesh J. Weerakkody*
  
  
 
 
  --
  Thanks  Best Regards,
 
  *Dinesh J. Weerakkody*
 



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Patrick Wendell
Neither is strictly optimal which is why we ended up supporting both.
Our reference build for packaging is Maven so you are less likely to
run into unexpected dependency issues, etc. Many developers use sbt as
well. It's somewhat religion and the best thing might be to try both
and see which you prefer.

- Patrick

On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com wrote:

 The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.


 We include the scala-maven-plugin in spark/pom.xml, so equivalent
 functionality is available using Maven.  You can start a console session
 with `mvn scala:console`.


 On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust mich...@databricks.com
 wrote:

 I'm going to have to disagree here.  If you are building a release
 distribution or integrating with legacy systems then maven is probably the
 correct choice.  However most of the core developers that I know use sbt,
 and I think its a better choice for exploration and development overall.
 That said, this probably falls into the category of a religious argument so
 you might want to look at both options and decide for yourself.

 In my experience the SBT build is significantly faster with less effort
 (and I think sbt is still faster even if you go through the extra effort of
 installing zinc) and easier to read.  The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.

 For example, if I wanted to see what test cases are available in the SQL
 subproject you can do the following:

 [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
 [info] Loading project definition from
 /Users/marmbrus/workspace/spark/project/project
 [info] Loading project definition from

 /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
 [info] Set current project to spark-parent (in build
 file:/Users/marmbrus/workspace/spark/)
  sql/test-only *tab*
 --
  org.apache.spark.sql.CachedTableSuite
 org.apache.spark.sql.DataTypeSuite
  org.apache.spark.sql.DslQuerySuite
 org.apache.spark.sql.InsertIntoSuite
 ...

 Another very useful feature is the development console, which starts an
 interactive REPL including the most recent version of the code and a lot of
 useful imports for some subprojects.  For example in the hive subproject it
 automatically sets up a temporary database with a bunch of test data
 pre-loaded:

 $ sbt/sbt hive/console
  hive/console
 ...
 import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive.test.TestHive._
 import org.apache.spark.sql.parquet.ParquetTestData
 Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_45).
 Type in expressions to have them evaluated.
 Type :help for more information.

 scala sql(SELECT * FROM src).take(2)
 res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

 Michael

 On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody 
 dineshjweerakk...@gmail.com wrote:

  Hi Stephen and Sean,
 
  Thanks for correction.
 
  On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:
 
   No, the Maven build is the main one.  I would use it unless you have a
   need to use the SBT build in particular.
   On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
   dineshjweerakk...@gmail.com wrote:
  
   Hi Yiming,
  
   I believe that both SBT and MVN is supported in SPARK, but SBT is
   preferred
   (I'm not 100% sure about this :) ). When I'm using MVN I got some
 build
   failures. After that used SBT and works fine.
  
   You can go through these discussions regarding SBT vs MVN and learn
 pros
   and cons of both [1] [2].
  
   [1]
  
  
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
  
   [2]
  
  
 
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
  
   Thanks,
  
   On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang 
 sdi...@gmail.com
   wrote:
  
Hi,
   
   
   
I am new in developing Spark and my current focus is about
   co-scheduling of
spark tasks. However, I am confused with the building tools:
 sometimes
   the
documentation uses mvn but sometimes uses sbt.
   
   
   
So, my question is that which one is the preferred tool of Spark
   community?
And what's the technical difference between them? Thank you!
   
   
   
Cheers,
   
Yiming
   
   
  
  
   --
   Thanks  Best Regards,
  
   *Dinesh J. 

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
Ok, strictly speaking, that's equivalent to your second class of
examples, development
console, not the first sbt console

On Sun, Nov 16, 2014 at 1:47 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will
 accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.


 We include the scala-maven-plugin in spark/pom.xml, so equivalent
 functionality is available using Maven.  You can start a console session
 with `mvn scala:console`.


 On Sun, Nov 16, 2014 at 1:23 PM, Michael Armbrust mich...@databricks.com
 wrote:

 I'm going to have to disagree here.  If you are building a release
 distribution or integrating with legacy systems then maven is probably the
 correct choice.  However most of the core developers that I know use sbt,
 and I think its a better choice for exploration and development overall.
 That said, this probably falls into the category of a religious argument
 so
 you might want to look at both options and decide for yourself.

 In my experience the SBT build is significantly faster with less effort
 (and I think sbt is still faster even if you go through the extra effort
 of
 installing zinc) and easier to read.  The console mode of sbt (just run
 sbt/sbt and then a long running console session is started that will
 accept
 further commands) is great for building individual subprojects or running
 single test suites.  In addition to being faster since its a long running
 JVM, its got a lot of nice features like tab-completion for test case
 names.

 For example, if I wanted to see what test cases are available in the SQL
 subproject you can do the following:

 [marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
 [info] Loading project definition from
 /Users/marmbrus/workspace/spark/project/project
 [info] Loading project definition from

 /Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
 [info] Set current project to spark-parent (in build
 file:/Users/marmbrus/workspace/spark/)
  sql/test-only *tab*
 --
  org.apache.spark.sql.CachedTableSuite
 org.apache.spark.sql.DataTypeSuite
  org.apache.spark.sql.DslQuerySuite
 org.apache.spark.sql.InsertIntoSuite
 ...

 Another very useful feature is the development console, which starts an
 interactive REPL including the most recent version of the code and a lot
 of
 useful imports for some subprojects.  For example in the hive subproject
 it
 automatically sets up a temporary database with a bunch of test data
 pre-loaded:

 $ sbt/sbt hive/console
  hive/console
 ...
 import org.apache.spark.sql.hive._
 import org.apache.spark.sql.hive.test.TestHive._
 import org.apache.spark.sql.parquet.ParquetTestData
 Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
 1.7.0_45).
 Type in expressions to have them evaluated.
 Type :help for more information.

 scala sql(SELECT * FROM src).take(2)
 res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

 Michael

 On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody 
 dineshjweerakk...@gmail.com wrote:

  Hi Stephen and Sean,
 
  Thanks for correction.
 
  On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:
 
   No, the Maven build is the main one.  I would use it unless you have a
   need to use the SBT build in particular.
   On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody 
   dineshjweerakk...@gmail.com wrote:
  
   Hi Yiming,
  
   I believe that both SBT and MVN is supported in SPARK, but SBT is
   preferred
   (I'm not 100% sure about this :) ). When I'm using MVN I got some
 build
   failures. After that used SBT and works fine.
  
   You can go through these discussions regarding SBT vs MVN and learn
 pros
   and cons of both [1] [2].
  
   [1]
  
  
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html
  
   [2]
  
  
 
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ
  
   Thanks,
  
   On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang 
 sdi...@gmail.com
   wrote:
  
Hi,
   
   
   
I am new in developing Spark and my current focus is about
   co-scheduling of
spark tasks. However, I am confused with the building tools:
 sometimes
   the
documentation uses mvn but sometimes uses sbt.
   
   
   
So, my question is that which one is the preferred tool of Spark
   community?
And what's the technical difference between them? Thank you!
   
   
   
Cheers,
   
Yiming
   
   
  
  
   --
   Thanks  Best Regards,
  
   *Dinesh J. Weerakkody*
  
  
 
 
  --
  Thanks  Best Regards,
 
  *Dinesh J. Weerakkody*
 





Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread jay vyas
This is more a curiosity than an immediate problem.

Here is my question: I ran into this easily solved issue
http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
recently.  The solution was to replace my class with a scala singleton,
which i guess is readily serializable.

So its clear that spark needs to serialize objects which carry the driver
methods for an app, in order to run... but I'm wondering,,, maybe there is
a way to change or update the spark API to catch unserializable spark apps
at compile time?


-- 
jay vyas


Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Reynold Xin
That's a great idea and it is also a pain point for some users. However, it
is not possible to solve this problem at compile time, because the content
of serialization can only be determined at runtime.

There are some efforts in Scala to help users avoid mistakes like this. One
example project that is more researchy is Spore:
http://docs.scala-lang.org/sips/pending/spores.html



On Sun, Nov 16, 2014 at 4:12 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 This is more a curiosity than an immediate problem.

 Here is my question: I ran into this easily solved issue

 http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
 recently.  The solution was to replace my class with a scala singleton,
 which i guess is readily serializable.

 So its clear that spark needs to serialize objects which carry the driver
 methods for an app, in order to run... but I'm wondering,,, maybe there is
 a way to change or update the spark API to catch unserializable spark apps
 at compile time?


 --
 jay vyas



Re: Is there a way for scala compiler to catch unserializable app code?

2014-11-16 Thread Andrew Ash
Hi Jay,

I just came across SPARK-720 Statically guarantee serialization will succeed
https://issues.apache.org/jira/browse/SPARK-720 which sounds like exactly
what you're referring to.  Like Reynold I think it's not possible at this
time but it would be good to get your feedback on that ticket.

Andrew


On Sun, Nov 16, 2014 at 4:37 PM, Reynold Xin r...@databricks.com wrote:

 That's a great idea and it is also a pain point for some users. However, it
 is not possible to solve this problem at compile time, because the content
 of serialization can only be determined at runtime.

 There are some efforts in Scala to help users avoid mistakes like this. One
 example project that is more researchy is Spore:
 http://docs.scala-lang.org/sips/pending/spores.html



 On Sun, Nov 16, 2014 at 4:12 PM, jay vyas jayunit100.apa...@gmail.com
 wrote:

  This is more a curiosity than an immediate problem.
 
  Here is my question: I ran into this easily solved issue
 
 
 http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
  recently.  The solution was to replace my class with a scala singleton,
  which i guess is readily serializable.
 
  So its clear that spark needs to serialize objects which carry the driver
  methods for an app, in order to run... but I'm wondering,,, maybe there
 is
  a way to change or update the spark API to catch unserializable spark
 apps
  at compile time?
 
 
  --
  jay vyas
 



Re: Regarding RecordReader of spark

2014-11-16 Thread Reynold Xin
I don't think the code is immediately obvious.

Davies - I think you added the code, and Josh reviewed it. Can you guys
explain and maybe submit a patch to add more documentation on the whole
thing?

Thanks.


On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com
wrote:

 Hello Everyone,

 I am going through the source code of rdd and Record readers
 There are found 2 classes

 1. WholeTextFileRecordReader
 2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )

 The description of both the classes is perfectly similar.

 I am not able to understand why we have 2 classes. Is
 CombineFileRecordReader providing some extra advantage?

 Regards
 Vibhanshu



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-16 Thread Josh Rosen
-1

I found a potential regression in 1.1.1 related to spark-submit and cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

I think that this is worth fixing.

On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote:

 +1

 Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are
 fixed. Hive version inspection works as expected.


 On 11/15/14 8:25 AM, Zach Fry wrote:

 +0

 I expect to start testing on Monday but won't have enough results to
 change
 my vote from +0
 until Monday night or Tuesday morning.

 Thanks,
 Zach



 --
 View this message in context: http://apache-spark-
 developers-list.1001551.n3.nabble.com/VOTE-Release-
 Apache-Spark-1-1-1-RC1-tp9311p9370.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Regarding RecordReader of spark

2014-11-16 Thread Andrew Ash
Filed as https://issues.apache.org/jira/browse/SPARK-4437

On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote:

 I don't think the code is immediately obvious.

 Davies - I think you added the code, and Josh reviewed it. Can you guys
 explain and maybe submit a patch to add more documentation on the whole
 thing?

 Thanks.


 On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad 
 vibhanshugs...@gmail.com
 wrote:

  Hello Everyone,
 
  I am going through the source code of rdd and Record readers
  There are found 2 classes
 
  1. WholeTextFileRecordReader
  2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )
 
  The description of both the classes is perfectly similar.
 
  I am not able to understand why we have 2 classes. Is
  CombineFileRecordReader providing some extra advantage?
 
  Regards
  Vibhanshu
 



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-16 Thread Kousuke Saruta

Now I've finished to revert for SPARK-4434 and opened PR.

(2014/11/16 17:08), Josh Rosen wrote:

-1

I found a potential regression in 1.1.1 related to spark-submit and cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

I think that this is worth fixing.

On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian lian.cs@gmail.com wrote:


+1

Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues are
fixed. Hive version inspection works as expected.


On 11/15/14 8:25 AM, Zach Fry wrote:


+0

I expect to start testing on Monday but won't have enough results to
change
my vote from +0
until Monday night or Tuesday morning.

Thanks,
Zach



--
View this message in context: http://apache-spark-
developers-list.1001551.n3.nabble.com/VOTE-Release-
Apache-Spark-1-1-1-RC1-tp9311p9370.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Yiming (John) Zhang
Hi Dinesh, Sean, Michael, Stephen, Mark, and Patrick

 

Thank you for your reply and discussions. So the conclusion is that mvn is 
preferred when packaging and distribution, while sbt is better for development. 
This also explains why the compilation tool of make-distribution.sh changed 
from sbt (in spark-0.9) to mvn(in spark-1.0).

 

Cheers,

Yiming

 

发件人: Dinesh J. Weerakkody [mailto:dineshjweerakk...@gmail.com] 
发送时间: 2014年11月16日 10:58
收件人: sdi...@gmail.com
抄送: dev@spark.apache.org
主题: Re: mvn or sbt for studying and developing Spark?

 

Hi Yiming,

I believe that both SBT and MVN is supported in SPARK, but SBT is preferred 
(I'm not 100% sure about this :) ). When I'm using MVN I got some build 
failures. After that used SBT and works fine.

You can go through these discussions regarding SBT vs MVN and learn pros and 
cons of both [1] [2].

[1] 
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html

[2] 
https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ

 

Thanks,

 

On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com 
mailto:sdi...@gmail.com  wrote:

Hi,



I am new in developing Spark and my current focus is about co-scheduling of
spark tasks. However, I am confused with the building tools: sometimes the
documentation uses mvn but sometimes uses sbt.



So, my question is that which one is the preferred tool of Spark community?
And what's the technical difference between them? Thank you!



Cheers,

Yiming




-- 

Thanks  Best Regards,

Dinesh J. Weerakkody



Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Mark Hamstra
More or less correct, but I'd add that there are an awful lot of software
systems out there that use Maven.  Integrating with those systems is
generally easier if you are also working with Spark in Maven.  (And I
wouldn't classify all of those Maven-built systems as legacy, Michael :)
 What that ends up meaning is that if you are working *on* Spark, then SBT
can be more convenient and productive; but if you are working *with* Spark
along with other significant pieces of software, then using Maven can be
the better approach.

On Sun, Nov 16, 2014 at 6:11 PM, Yiming (John) Zhang sdi...@gmail.com
wrote:

 Hi Dinesh, Sean, Michael, Stephen, Mark, and Patrick



 Thank you for your reply and discussions. So the conclusion is that mvn is
 preferred when packaging and distribution, while sbt is better for
 development. This also explains why the compilation tool of
 make-distribution.sh changed from sbt (in spark-0.9) to mvn(in spark-1.0).



 Cheers,

 Yiming



 发件人: Dinesh J. Weerakkody [mailto:dineshjweerakk...@gmail.com]
 发送时间: 2014年11月16日 10:58
 收件人: sdi...@gmail.com
 抄送: dev@spark.apache.org
 主题: Re: mvn or sbt for studying and developing Spark?



 Hi Yiming,

 I believe that both SBT and MVN is supported in SPARK, but SBT is
 preferred (I'm not 100% sure about this :) ). When I'm using MVN I got some
 build failures. After that used SBT and works fine.

 You can go through these discussions regarding SBT vs MVN and learn pros
 and cons of both [1] [2].

 [1]
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html

 [2]
 https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ



 Thanks,



 On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang sdi...@gmail.com
 mailto:sdi...@gmail.com  wrote:

 Hi,



 I am new in developing Spark and my current focus is about co-scheduling of
 spark tasks. However, I am confused with the building tools: sometimes the
 documentation uses mvn but sometimes uses sbt.



 So, my question is that which one is the preferred tool of Spark community?
 And what's the technical difference between them? Thank you!



 Cheers,

 Yiming




 --

 Thanks  Best Regards,

 Dinesh J. Weerakkody




Re: [MLlib] Contributing Algorithm for Outlier Detection

2014-11-16 Thread slcclimber
Ashutosh,
The counter will certainly be an parellization issue when multiple nodes are
used specially over massive datasets.
A better approach would be to use some thing along these lines:

val index = sc.parallelize(Range.Long(0, rdd.count, 1),
rdd.partitions.size)
val rddWithIndex = rdd.zip(index)
Which zips the two RDD's in a parallelizable fashion.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9399.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org