Re: Spark Avro Generation

2014-08-12 Thread Devl Devel
Thanks very much that helps, not having to generate the entire build.


On Mon, Aug 11, 2014 at 6:09 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:

 If you don't want to build the entire thing, you can also do

 mvn generate-sources in externals/flume-sink

 Thanks,
 Ron

 Sent from my iPhone

  On Aug 11, 2014, at 8:32 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:
 
  Jay running sbt compile or assembly should generate the sources.
 
  On Monday, August 11, 2014, Devl Devel devl.developm...@gmail.com
 wrote:
 
  Hi
 
  So far I've been managing to build Spark from source but since a change
 in
  spark-streaming-flume I have no idea how to generate classes (e.g.
  SparkFlumeProtocol) from the avro schema.
 
  I have used sbt to run avro:generate (from the top level spark dir) but
 it
  produces nothing - it just says:
 
  avro:generate
  [success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM.
 
  Please can someone send me their build.sbt or just tell me how to build
  spark so that all avro files get generated as well?
 
  Sorry for the noob question but I really have tried by best on this one!
  Cheers
 



Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-12 Thread Graham Dennis
I've submitted a work-in-progress pull request for this issue that I'd like
feedback on.  See https://github.com/apache/spark/pull/1890 . I've also
submitted a pull request for the related issue that the exceptions hit when
trying to use a custom kryo registrator are being swallowed:
https://github.com/apache/spark/pull/1827

The approach in my pull request is to get the Worker processes to download
the application jars and add them to the Executor class path at launch
time. There are a couple of things that still need to be done before this
can be merged:
1. At the moment, the first time a task runs in the executor, the
application jars are downloaded again.  My solution here would be to make
the executor not download any jars that already exist.  Previously, the
driver  executor kept track of the timestamp of jar files and would
redownload 'updated' jars, however this never made sense as the previous
version of the updated jar may have already been loaded into the executor,
so the updated jar may have no effect.  As my current pull request removes
the timestamp for jars, just checking whether the jar exists will allow us
to avoid downloading the jars again.
2. Tests. :-)

A side-benefit of my pull request is that you will be able to use custom
serialisers that are distributed in a user jar.  Currently, the serialiser
instance is created in the Executor process before the first task is
received and therefore before any user jars are downloaded.  As this PR
adds user jars to the Executor process at launch time, this won't be an
issue.


On 7 August 2014 12:01, Graham Dennis graham.den...@gmail.com wrote:

 See my comment on https://issues.apache.org/jira/browse/SPARK-2878 for
 the full stacktrace, but it's in the BlockManager/BlockManagerWorker where
 it's trying to fulfil a getBlock request for another node.  The objects
 that would be in the block haven't yet been serialised, and that then
 causes the deserialisation to happen on that thread.  See
 MemoryStore.scala:102.


 On 7 August 2014 11:53, Reynold Xin r...@databricks.com wrote:

 I don't think it was a conscious design decision to not include the
 application classes in the connection manager serializer. We should fix
 that. Where is it deserializing data in that thread?

 4 might make sense in the long run, but it adds a lot of complexity to
 the code base (whole separate code base, task queue, blocking/non-blocking
 logic within task threads) that can be error prone, so I think it is best
 to stay away from that right now.





 On Wed, Aug 6, 2014 at 6:47 PM, Graham Dennis graham.den...@gmail.com
 wrote:

 Hi Spark devs,

 I’ve posted an issue on JIRA (
 https://issues.apache.org/jira/browse/SPARK-2878) which occurs when
 using
 Kryo serialisation with a custom Kryo registrator to register custom
 classes with Kryo.  This is an insidious issue that non-deterministically
 causes Kryo to have different ID number = class name maps on different
 nodes, which then causes weird exceptions (ClassCastException,
 ClassNotFoundException, ArrayIndexOutOfBoundsException) at
 deserialisation
 time.  I’ve created a reliable reproduction for the issue here:
 https://github.com/GrahamDennis/spark-kryo-serialisation

 I’m happy to try and put a pull request together to try and address this,
 but it’s not obvious to me the right way to solve this and I’d like to
 get
 feedback / ideas on how to address this.

 The root cause of the problem is a Failed to run spark.kryo.registrator”
 error which non-deterministically occurs in some executor processes
 during
 operation.  My custom Kryo registrator is in the application jar, and it
 is
 accessible on the worker nodes.  This is demonstrated by the fact that
 most
 of the time the custom kryo registrator is successfully run.

 What’s happening is that Kryo serialisation/deserialisation is happening
 most of the time on an “Executor task launch worker” thread, which has
 the
 thread's class loader set to contain the application jar.  This happens
 in
 `org.apache.spark.executor.Executor.TaskRunner.run`, and from what I can
 tell, it is only these threads that have access to the application jar
 (that contains the custom Kryo registrator).  However, the
 ConnectionManager threads sometimes need to serialise/deserialise objects
 to satisfy “getBlock” requests when the objects haven’t previously been
 serialised.  As the ConnectionManager threads don’t have the application
 jar available from their class loader, when it tries to look up the
 custom
 Kryo registrator, this fails.  Spark then swallows this exception, which
 results in a different ID number — class mapping for this kryo instance,
 and this then causes deserialisation errors later on a different node.

 A related issue to the issue reported in SPARK-2878 is that Spark
 probably
 shouldn’t swallow the ClassNotFound exception for custom Kryo
 registrators.
  The user has explicitly specified this class, and if it
 deterministically
 can’t be found, then it 

Re: fair scheduler

2014-08-12 Thread fireflyc
@Crystal
You can use spark on yarn. Yarn have fair scheduler,modified yarn-site.xml.

发自我的 iPad

 在 2014年8月11日,6:49,Matei Zaharia matei.zaha...@gmail.com 写道:
 
 Hi Crystal,
 
 The fair scheduler is only for jobs running concurrently within the same 
 SparkContext (i.e. within an application), not for separate applications on 
 the standalone cluster manager. It has no effect there. To run more of those 
 concurrently, you need to set a cap on how many cores they each grab with 
 spark.cores.max.
 
 Matei
 
 On August 10, 2014 at 12:13:08 PM, 李宜芳 (xuite...@gmail.com) wrote:
 
 Hi  
 
 I am trying to switch from FIFO to FAIR with standalone mode.  
 
 my environment:  
 hadoop 1.2.1  
 spark 0.8.0 using stanalone mode  
 
 and i modified the code..  
 
 ClusterScheduler.scala - System.getProperty(spark.scheduler.mode,  
 FAIR))  
 SchedulerBuilder.scala -  
 val DEFAULT_SCHEDULING_MODE = SchedulingMode.FAIR  
 
 LocalScheduler.scala -  
 System.getProperty(spark.scheduler.mode, FAIR)  
 
 spark-env.sh -  
 export SPARK_JAVA_OPTS=-Dspark.scheduler.mode=FAIR  
 export SPARK_JAVA_OPTS= -Dspark.scheduler.mode=FAIR ./run-example  
 org.apache.spark.examples.SparkPi spark://streaming1:7077  
 
 
 but it's not work  
 i want to switch from fifo to fair  
 how can i do??  
 
 Regards  
 Crystal Lee  
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-08-12 Thread RJ Nowling
Hi all,

I wanted to follow up.

I have a prototype for an optimized version of hierarchical k-means.  I
wanted to get some feedback on my apporach.

Jeremy's implementation splits the largest cluster in each round.  Is it
better to do it that way or to split each cluster in half?

Are there are any open-source examples that are being widely used in
production?

Thanks!



On Fri, Jul 18, 2014 at 8:05 AM, RJ Nowling rnowl...@gmail.com wrote:

 Nice to meet you, Jeremy!

 This is great!  Hierarchical clustering was next on my list --
 currently trying to get my PR for MiniBatch KMeans accepted.

 If it's cool with you, I'll try converting your code to fit in with
 the existing MLLib code as you suggest. I also need to review the
 Decision Tree code (as suggested above) to see how much of that can be
 reused.

 Maybe I can ask you to do a code review for me when I'm done?





 On Thu, Jul 17, 2014 at 8:31 PM, Jeremy Freeman
 freeman.jer...@gmail.com wrote:
  Hi all,
 
  Cool discussion! I agree that a more standardized API for clustering, and
  easy access to underlying routines, would be useful (we've also been
  discussing this when trying to develop streaming clustering algorithms,
  similar to https://github.com/apache/spark/pull/1361)
 
  For divisive, hierarchical clustering I implemented something awhile
 back,
  here's a gist.
 
  https://gist.github.com/freeman-lab/5947e7c53b368fe90371
 
  It does bisecting k-means clustering (with k=2), with a recursive class
 for
  keeping track of the tree. I also found this much better than
 agglomerative
  methods (for the reasons Hector points out).
 
  This needs to be cleaned up, and can surely be optimized (esp. by
 replacing
  the core KMeans step with existing MLLib code), but I can say I was
 running
  it successfully on quite large data sets.
 
  RJ, depending on where you are in your progress, I'd be happy to help
 work
  on this piece and / or have you use this as a jumping off point, if
 useful.
 
  -- Jeremy
 
 
 
  --
  View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contributing-to-MLlib-Proposal-for-Clustering-Algorithms-tp7212p7398.html
  Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.



 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Spark testsuite error for hive 0.13.

2014-08-12 Thread Zhan Zhang
Problem solved by a walkaround with create database and use database.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-testsuite-error-for-hive-0-13-tp7807p7819.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-12 Thread Debasish Das
I figured out the issuethe driver memory was at 512 MB and for our
datasets, the following code needed more memory...

// Materialize usersOut and productsOut.

usersOut.count()

productsOut.count()

Thanks.

Deb


On Sat, Aug 9, 2014 at 6:12 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Actually nope it did not work fine...

 With multiple ALS iteration, I am getting the same error (with or without
 my mllib changes)

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 206 in stage 52.0 failed 4 times, most recent
 failure: Lost task 206.3 in stage 52.0 (TID ,
 tblpmidn42adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
 scala.Tuple1 cannot be cast to scala.Product2


 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159)

 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)


 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)


 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)


 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)


 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129)


 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)


 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)


 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)


 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 org.apache.spark.scheduler.Task.run(Task.scala:54)


 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)


 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 java.lang.Thread.run(Thread.java:744)

 Driver stacktrace:

 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)

 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)

 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:682)

 at
 

FileNotFoundException with _temporary in the name

2014-08-12 Thread Andrew Ash
Hi Spark devs,

Several people on the mailing list have seen issues with
FileNotFoundExceptions related to _temporary in the name.  I've personally
observed this several times, as have a few of my coworkers on various Spark
clusters.

Any ideas what might be going on?

I've collected the various stack traces from various mailing list posts and
put their stacktraces and a link back to the original report at
https://issues.apache.org/jira/browse/SPARK-2984

Thanks!
Andrew