from:"Nick Pentreath"

Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Nick Pentreath

Agree that filter is perhaps unintuitive. Though the Scala collections API has 
filter and filterNot which together provide context that makes it more 
intuitive.


And yes the change could be via added methods that don't break existing API.


Still overall I would be -1 on this unless a significant proportion of users 
would find it added value.




Actually adding filterNot while not that necessary would make more sense in 
my view








—
Sent from Mailbox for iPhone

On Thu, Feb 27, 2014 at 3:56 PM, Bertrand Dechoux decho...@gmail.com
wrote:

 I understand the explanation but I had to try. However, the change could be
 made without breaking anything but that's another story.
 Regards
 Bertrand
 Bertrand Dechoux
 On Thu, Feb 27, 2014 at 2:05 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:
 filter comes from the Scala collection method filter. I'd say it's best
 to keep in line with the Scala collections API, as Spark has done with RDDs
 generally (map, flatMap, take etc), so that is is easier and natural for
 developers to apply the same thinking for Scala (parallel) collections to
 Spark RDDs.

 Plus, such an API change would be a major breaking one and IMO not a good
 idea at this stage.

 deffilter(p: (A) = 
 Booleanhttp://www.scala-lang.org/api/2.10.3/scala/Boolean.html
 ): Seq http://www.scala-lang.org/api/2.10.3/scala/collection/Seq.html[A]

 Selects all elements of this sequence which satisfy a predicate.
 p

 the predicate used to test elements.
 returns

 a new sequence consisting of all elements of this sequence that satisfy
 the given predicate p. The order of the elements is preserved.


 On Thu, Feb 27, 2014 at 2:36 PM, Bertrand Dechoux decho...@gmail.comwrote:

 Hi,

 It might seem like a trivial issue but even though it is somehow a
 standard name filter() is not really explicit in which way it does work.
 Sure, it makes sense to provide a filter function but what happens when it
 returns true? Is the current element removed or kept? It is not really
 obvious.

 Has another name been already discussed? It could be keep() or remove().
 But take() could also be reused and instead of providing a number, the
 filter function could be requested.

  Regards

 Bertrand

Re: Running actions in loops

2014-03-07 Thread Nick Pentreath

There is #3 which is use mapPartitions and init one jodatime obj per partition, 
which is less overhead for large objects—
Sent from Mailbox for iPhone

On Sat, Mar 8, 2014 at 2:54 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 So the whole function closure you want to apply on your RDD needs to be
 serializable so that it can be serialized  sent to workers to operate on
 RDD. So objects of jodatime cannot be serialized  sent hence jodatime is
 out of work. 2 bad answers
 1. initialize jodatime for each row  complete work  destroy them, that
 way they are only intialized when job is running  need not be sent across.
 2. Write your own parser  hope jodatime guys get their act together.
 Regards
 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi
 On Fri, Mar 7, 2014 at 12:56 PM, Ognen Duzlevski
 og...@nengoiksvelzud.comwrote:
  Mayur, have not thought of that. Yes, I use jodatime. What is the scope
 that this serialization issue applies to? Only the method making a call
 into / using such a library? The whole class the method using such a
 library belongs to? Sorry if it is a dumb question :)

 Ognen


 On 3/7/14, 1:29 PM, Mayur Rustagi wrote:

 Mostly the job you are executing is not serializable, this typically
 happens when you have a library that is not serializable.. are you using
 any library like jodatime etc ?

  Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi https://twitter.com/mayur_rustagi



 On Thu, Mar 6, 2014 at 9:50 PM, Ognen Duzlevski 
 og...@plainvanillagames.com wrote:

 It looks like the problem is in the filter task - is there anything
 special about filter()?

 I have removed the filter line from the loops just to see if things will
 work and they do.

 Anyone has any ideas?

 Thanks!
 Ognen


 On 3/6/14, 9:39 PM, Ognen Duzlevski wrote:

 Hello,

 What is the general approach people take when trying to do analysis
 across multiple large files where the data to be extracted from a
 successive file depends on the data extracted from a previous file or set
 of files?

 For example:
 I have the following: a group of HDFS files each 20+GB in size. I need
 to extract event1 on day 1 from first file and extract event2 from all
 remaining files in a period of successive dates, then do a calculation on
 the two events.
 I then need to move on to day2, extract event1 (with certain
 properties), take all following days, extract event2 and run a calculation
 against previous day for all days in period. So on and so on.

 I have verified that the following (very naive approach doesn't work):

 def
 calcSimpleRetention(start:String,end:String,event1:String,event2:String):Map[String,List[Double]]
 = {
 val epd = new PipelineDate(end)
 val result = for {
   dt1 - PipelineDate.getPeriod(new PipelineDate(start), epd)
   val f1 = sc.textFile(dt1.toJsonHdfsFileName)
   val e1 = f1.filter(_.split(,)(0).split(:)(1).replace(\,)
 == event1).map(line =
 (line.split(,)(2).split(:)(1).replace(\,),0)).cache
   val c = e1.count.toDouble

   val intres = for {
 dt2 - PipelineDate.getPeriod(dt1+1,epd)
 val f2 = sc.textFile(dt2.toJsonHdfsFileName)
 val e2 =
 f2.filter(_.split(,)(0).split(:)(1).replace(\,) ==
 event2).map(line = (line.split(,)(2).split(:)(1).replace(\,),1))
 val e1e2 = e1.union(e2)
 val r = e1e2.groupByKey().filter(e = e._2.length  1 
 e._2.filter(_==0).length0).count.toDouble
   } yield (c/r) // get the retention rate
 } yield (dt1.toString-intres)
 Map(result:_*)
   }

 I am getting the following errors:
 14/03/07 03:22:25 INFO SparkContext: Starting job: count at
 CountActor.scala:33
 14/03/07 03:22:25 INFO DAGScheduler: Got job 0 (count at
 CountActor.scala:33) with 140 output partitions (allowLocal=false)
 14/03/07 03:22:25 INFO DAGScheduler: Final stage: Stage 0 (count at
 CountActor.scala:33)
 14/03/07 03:22:25 INFO DAGScheduler: Parents of final stage: List()
 14/03/07 03:22:25 INFO DAGScheduler: Missing parents: List()
 14/03/07 03:22:25 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[3] at
 map at CountActor.scala:32), which has no missing parents
 14/03/07 03:22:25 INFO DAGScheduler: Failed to run count at
 CountActor.scala:33
 14/03/07 03:22:25 ERROR OneForOneStrategy: Job aborted: Task not
 serializable: java.io.NotSerializableException:
 com.github.ognenpv.pipeline.CountActor
 org.apache.spark.SparkException: Job aborted: Task not serializable:
 java.io.NotSerializableException: com.github.ognenpv.pipeline.CountActor
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at

Re: Running Spark on a single machine

2014-03-16 Thread Nick Pentreath

Please follow the instructions at 
http://spark.apache.org/docs/latest/index.html and 
http://spark.apache.org/docs/latest/quick-start.html to get started on a local 
machine.




—
Sent from Mailbox for iPhone

On Sun, Mar 16, 2014 at 11:39 PM, goi cto goi@gmail.com wrote:

 Hi,
 I know it is probably not the purpose of spark but the syntax is easy and
 cool...
 I need to run some spark like code in memory on a single machine any
 pointers how to optimize it to run only on one machine?
 -- 
 Eran | CTO

Re: Calling Spahk enthusiasts in Boston

2014-03-31 Thread Nick Pentreath

I would offer to host one in Cape Town but we're almost certainly the only
Spark users in the country apart from perhaps one in Johanmesburg :)—
Sent from Mailbox for iPhone

On Mon, Mar 31, 2014 at 8:53 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

My fellow Bostonians and New Englanders,
We cannot allow New York to beat us to having a banging Spark meetup.
Respond to me (and I guess also Andy?) if you are interested.
Yana,
I'm not sure either what is involved in organizing, but we can figure it
out. I didn't know about the meetup that never took off.
Nick
On Mon, Mar 31, 2014 at 2:31 PM, Yana Kadiyska yana.kadiy...@gmail.comwrote:
Nicholas, I'm in Boston and would be interested in a Spark group. Not
sure if you know this -- there was a meetup that never got off the
ground. Anyway, I'd be +1 for attending. Not sure what is involved in
organizing. Seems a shame that a city like Boston doesn't have one.

On Mon, Mar 31, 2014 at 2:02 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
As in, I am interested in helping organize a Spark meetup in the Boston
area.

On Mon, Mar 31, 2014 at 2:00 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:

Well, since this thread has played out as it has, lemme throw in a
shout-out for Boston.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Calling-Spahk-enthusiasts-in-Boston-tp3544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

NPE using saveAsTextFile

2014-04-08 Thread Nick Pentreath

Hi

I'm using Spark 0.9.0.

When calling saveAsTextFile on a custom hadoop inputformat (loaded with
newAPIHadoopRDD), I get the following error below.

If I call count, I get the correct count of number of records, so the
inputformat is being read correctly... the issue only appears when trying
to use saveAsTextFile.

If I call first() I get the correct output, also. So it doesn't appear to
be anything with the data or inputformat.

Any idea what the actual problem is, since this stack trace is not obvious
(though it seems to be in ResultTask which ultimately causes this).

Is this a known issue at all?


==

14/04/08 16:00:46 ERROR OneForOneStrategy:
java.lang.NullPointerException
at
com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
at
com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
at com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
at org.apache.spark.scheduler.ResultTask$.serializeInfo(ResultTask.scala:48)
at org.apache.spark.scheduler.ResultTask.writeExternal(ResultTask.scala:123)
at
java.io.ObjectOutputStream.writeExternalData(ObjectOutputStream.java:1443)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1414)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath

Ok I thought it may be closing over the config option. I am using config
for job configuration, but extracting vals from that. So not sure why as I
thought I'd avoided closing over it. Will go back to source and see where
it is creeping in.



On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I haven't seen this but it may be a bug in Typesafe Config, since this is
 serializing a Config object. We don't actually use Typesafe Config
 ourselves.

 Do you have any nulls in the data itself by any chance? And do you know
 how that Config object is getting there?

 Matei

 On Apr 9, 2014, at 11:38 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 Anyone have a chance to look at this?

 Am I just doing something silly somewhere?

 If it makes any difference, I am using the elasticsearch-hadoop plugin for
 ESInputFormat. But as I say, I can parse the data (count, first() etc). I
 just can't save it as text file.




 On Tue, Apr 8, 2014 at 4:50 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:

 Hi

 I'm using Spark 0.9.0.

 When calling saveAsTextFile on a custom hadoop inputformat (loaded with
 newAPIHadoopRDD), I get the following error below.

 If I call count, I get the correct count of number of records, so the
 inputformat is being read correctly... the issue only appears when trying
 to use saveAsTextFile.

 If I call first() I get the correct output, also. So it doesn't appear to
 be anything with the data or inputformat.

 Any idea what the actual problem is, since this stack trace is not
 obvious (though it seems to be in ResultTask which ultimately causes this).

 Is this a known issue at all?


 ==

 14/04/08 16:00:46 ERROR OneForOneStrategy:
 java.lang.NullPointerException
  at
 com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
 at
 com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
  at
 com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
 at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
  at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
 at java.io.ObjectStreamClass.invokeWriteObject

Re: NPE using saveAsTextFile

2014-04-10 Thread Nick Pentreath

There was a closure over the config object lurking around - but in any case
upgrading to 1.2.0 for config did the trick as it seems to have been a bug
in Typesafe config,

Thanks Matei!


On Thu, Apr 10, 2014 at 8:46 AM, Nick Pentreath nick.pentre...@gmail.comwrote:

 Ok I thought it may be closing over the config option. I am using config
 for job configuration, but extracting vals from that. So not sure why as I
 thought I'd avoided closing over it. Will go back to source and see where
 it is creeping in.



 On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia matei.zaha...@gmail.comwrote:

 I haven't seen this but it may be a bug in Typesafe Config, since this is
 serializing a Config object. We don't actually use Typesafe Config
 ourselves.

 Do you have any nulls in the data itself by any chance? And do you know
 how that Config object is getting there?

 Matei

 On Apr 9, 2014, at 11:38 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 Anyone have a chance to look at this?

 Am I just doing something silly somewhere?

 If it makes any difference, I am using the elasticsearch-hadoop plugin
 for ESInputFormat. But as I say, I can parse the data (count, first() etc).
 I just can't save it as text file.




 On Tue, Apr 8, 2014 at 4:50 PM, Nick Pentreath 
 nick.pentre...@gmail.comwrote:

 Hi

 I'm using Spark 0.9.0.

 When calling saveAsTextFile on a custom hadoop inputformat (loaded with
 newAPIHadoopRDD), I get the following error below.

 If I call count, I get the correct count of number of records, so the
 inputformat is being read correctly... the issue only appears when trying
 to use saveAsTextFile.

 If I call first() I get the correct output, also. So it doesn't appear
 to be anything with the data or inputformat.

 Any idea what the actual problem is, since this stack trace is not
 obvious (though it seems to be in ResultTask which ultimately causes this).

 Is this a known issue at all?


 ==

 14/04/08 16:00:46 ERROR OneForOneStrategy:
 java.lang.NullPointerException
  at
 com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
 at
 com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
  at
 com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:601)
 at
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:975)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1480)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
  at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1493)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1416)
  at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:346)
  at scala.collection.immutable.$colon$colon.writeObject(List.scala:379)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-16 Thread Nick Pentreath

I'd also say that running for 100 iterations is a waste of resources, as
ALS will typically converge pretty quickly, as in within 10-20 iterations.


On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li lixiaolima...@gmail.com wrote:

 Thanks a lot for your information. It really helps me.


 On Tue, Apr 15, 2014 at 7:57 PM, Cheng Lian lian.cs@gmail.com wrote:

 Probably this JIRA 
 issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves your 
 problem. When running with large iteration number, the lineage
 DAG of ALS becomes very deep, both DAGScheduler and Java serializer may
 overflow because they are implemented in a recursive way. You may resort to
 checkpointing as a workaround.


 On Wed, Apr 16, 2014 at 5:29 AM, Xiaoli Li lixiaolima...@gmail.comwrote:

 Hi,

 I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory.
 ALS program cannot run  even with a very small size of training data (about
 91 lines) due to StackVverFlow error when I set the number of iterations to
 100. I think the problem may be caused by updateFeatures method which
 updates products RDD iteratively by join previous products RDD.


 I am writing a program which has a similar update process with ALS.
 This problem also appeared when I iterate too many times (more than 80).

 The iterative part of my code is as following:

 solution = outlinks.join(solution). map {
  ...
  }


 Has anyone had similar problem?  Thanks.


 Xiaoli

Re: User/Product Clustering with pySpark ALS

2014-04-29 Thread Nick Pentreath

There's no easy way to d this currently. The pieces are there from the PySpark 
code for regression which should be adaptable.


But you'd have to roll your own solution.




This is something I also want so I intend to put together a pull request for 
this soon
—
Sent from Mailbox

On Tue, Apr 29, 2014 at 4:28 PM, Laird, Benjamin
benjamin.la...@capitalone.com wrote:

 Hi all -
 I’m using pySpark/MLLib ALS for user/item clustering and would like to 
 directly access the user/product RDDs (called userFeatures/productFeatures in 
 class MatrixFactorizationModel in 
 mllib/recommendation/MatrixFactorizationModel.scala
 This doesn’t seem to complex, but it doesn’t seem like the functionality is 
 currently available. I think it requires accessing the underlying java mode 
 like so:
 model = ALS.train(ratings,1,iterations=1,blocks=5)
 userFeatures = RDD(model.javamodel.userFeatures, sc, ???)
 However, I don’t know what to pass as the deserializer. I need these low 
 dimensional vectors as an RDD to then use in Kmeans clustering. Has anyone 
 done something similar?
 Ben
 
 The information contained in this e-mail is confidential and/or proprietary 
 to Capital One and/or its affiliates. The information transmitted herewith is 
 intended only for use by the individual or entity to which it is addressed.  
 If the reader of this message is not the intended recipient, you are hereby 
 notified that any review, retransmission, dissemination, distribution, 
 copying or other use of, or taking of any action in reliance upon this 
 information is strictly prohibited. If you have received this communication 
 in error, please contact the sender and delete the material from your 
 computer.

spark-submit / S3

2014-05-16 Thread Nick Pentreath

Hi

I see from the docs for 1.0.0 that the new spark-submit mechanism seems
to support specifying the jar with hdfs:// or http://

Does this support S3? (It doesn't seem to as I have tried it on EC2 but
doesn't seem to work):

./bin/spark-submit --master local[2] --class myclass s3n://bucket/myapp.jar
args

Re: Spark on HBase vs. Spark on HDFS

2014-05-22 Thread Nick Pentreath

Hi

In my opinion, running HBase for immutable data is generally overkill in
particular if you are using Shark anyway to cache and analyse the data and
provide the speed.

HBase is designed for random-access data patterns and high throughput R/W
activities. If you are only ever writing immutable logs, then that is what
HDFS is designed for.

Having said that, if you replace HBase you will need to come up with a
reliable way to put data into HDFS (a log aggregator like Flume or message
bus like Kafka perhaps, etc), so the pain of doing that may not be worth it
given you already know HBase.


On Thu, May 22, 2014 at 9:33 AM, Limbeck, Philip philip.limb...@automic.com
 wrote:

  HI!



 We are currently using HBase as our primary data store of different
 event-like data. On-top of that, we use Shark to aggregate this data and
 keep it
 in memory for fast data access.  Since we use no specific HBase
 functionality whatsoever except Putting data into it, a discussion
 came up on having to set up an additional set of components on top of HDFS
 instead of just writing to HDFS directly.

  Is there any overview regarding implications of doing that ? I mean
 except things like taking care of file structure and the like. What is the
 true

 advantage of Spark on HBase in favor of Spark on HDFS?



 Best

 Philip

 Automic Software GmbH, Hauptstrasse 3C, 3012 Wolfsgraben
 Firmenbuchnummer/Commercial Register No. 275184h
 Firmenbuchgericht/Commercial Register Court: Landesgericht St. Poelten

 This email (including any attachments) may contain information which is
 privileged, confidential, or protected. If you are not the intended
 recipient, note that any disclosure, copying, distribution, or use of the
 contents of this message and attached files is prohibited. If you have
 received this email in error, please notify the sender and delete this
 email and any attached files.

Re: Writing RDDs from Python Spark progrma (pyspark) to HBase

2014-05-28 Thread Nick Pentreath

It's not possible currently to write anything other than text (or pickle
files I think in 1.0.0 or if not then in 1.0.1) from PySpark.

I have an outstanding pull request to add READING any InputFormat from
PySpark, and after that is in I will look into OutputFormat too.

What does your data look like? Any details about your use case that you
could share would aid the design of this feature.

N


On Wed, May 28, 2014 at 3:00 PM, gaurav.dasgupta gaurav.d...@gmail.comwrote:

 Hi,

 I am unable to understand how to write data directly on HBase table from a
 Spark (pyspark) Python program. Is this possible in the current Spark
 releases? If so, can someone provide an example code snippet to do this?

 Thanks in advance.

 Regards,
 Gaurav



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Writing-RDDs-from-Python-Spark-progrma-pyspark-to-HBase-tp6469.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Python, Spark and HBase

2014-05-29 Thread Nick Pentreath

Hi Tommer,

I'm working on updating and improving the PR, and will work on getting an
HBase example working with it. Will feed back as soon as I have had the
chance to work on this a bit more.

N


On Thu, May 29, 2014 at 3:27 AM, twizansk twiza...@gmail.com wrote:

 The code which causes the error is:

 The code which causes the error is:

 sc = SparkContext(local, My App)
 rdd = sc.newAPIHadoopFile(
 name,
 'org.apache.hadoop.hbase.mapreduce.TableInputFormat',
 'org.apache.hadoop.hbase.io.ImmutableBytesWritable',
 'org.apache.hadoop.hbase.client.Result',
 conf={hbase.zookeeper.quorum: my-host,
   hbase.rootdir: hdfs://my-host:8020/hbase,
   hbase.mapreduce.inputtable: data})

 The full stack trace is:



 Py4JError Traceback (most recent call last)
 ipython-input-8-3b9a4ea2f659 in module()
   7 conf={hbase.zookeeper.quorum: my-host,
   8   hbase.rootdir: hdfs://my-host:8020/hbase,
  9   hbase.mapreduce.inputtable: data})
  10
  11

 /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.pyc in
 newAPIHadoopFile(self, name, inputformat_class, key_class, value_class,
 key_wrapper, value_wrapper, conf)
 281 for k, v in conf.iteritems():
 282 jconf[k] = v
 -- 283 jrdd = self._jvm.PythonRDD.newAPIHadoopFile(self._jsc,
 name,
 inputformat_class, key_class, value_class,
 284 key_wrapper,
 value_wrapper, jconf)
 285 return RDD(jrdd, self, PickleSerializer())


 /opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
 in __getattr__(self, name)
 657 else:
 658 raise Py4JError('{0} does not exist in the JVM'.
 -- 659 format(self._fqn + name))
 660
 661 def __call__(self, *args):

 Py4JError: org.apache.spark.api.python.PythonRDDnewAPIHadoopFile does not
 exist in the JVM



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6507.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Nick Pentreath

@Sean, the %% syntax in SBT should automatically add the Scala major
version qualifier (_2.10, _2.11 etc) for you, so that does appear to be
correct syntax for the build.

I seemed to run into this issue with some missing Jackson deps, and solved
it by including the jar explicitly on the driver class path:

bin/spark-submit *-*
*-driver-class-path
SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar* --class
SimpleApp SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar

Seems redundant to me since I thought that the JAR as argument is copied to
driver and made available. But this solved it for me so perhaps give it a
try?



On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:

 Those aren't the names of the artifacts:


 http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22

 The name is spark-streaming-twitter_2.10

 On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee
 unorthodox.engine...@gmail.com wrote:
  Man, this has been hard going. Six days, and I finally got a Hello
 World
  App working that I wrote myself.
 
  Now I'm trying to make a minimal streaming app based on the twitter
  examples, (running standalone right now while learning) and when running
 it
  like this:
 
  bin/spark-submit --class SimpleApp
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
 
  I'm getting this error:
 
  Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/spark/streaming/twitter/TwitterUtils$
 
  Which I'm guessing is because I haven't put in a dependency to
  external/twitter in the .sbt, but _how_? I can't find any docs on it.
  Here's my build file so far:
 
  simple.sbt
  --
  name := Simple Project
 
  version := 1.0
 
  scalaVersion := 2.10.4
 
  libraryDependencies += org.apache.spark %% spark-core % 1.0.0
 
  libraryDependencies += org.apache.spark %% spark-streaming % 1.0.0
 
  libraryDependencies += org.apache.spark %% spark-streaming-twitter %
  1.0.0
 
  libraryDependencies += org.twitter4j % twitter4j-stream % 3.0.3
 
  resolvers += Akka Repository at http://repo.akka.io/releases/;
  --
 
  I've tried a few obvious things like adding:
 
  libraryDependencies += org.apache.spark %% spark-external % 1.0.0
 
  libraryDependencies += org.apache.spark %% spark-external-twitter %
  1.0.0
 
  because, well, that would match the naming scheme implied so far, but it
  errors.
 
 
  Also, I just realized I don't completely understand if:
  (a) the spark-submit command _sends_ the .jar to all the workers, or
  (b) the spark-submit commands sends a _job_ to the workers, which are
  supposed to already have the jar file installed (or in hdfs), or
  (c) the Context is supposed to list the jars to be distributed. (is that
  deprecated?)
 
  One part of the documentation says:
 
   Once you have an assembled jar you can call the bin/spark-submit
 script as
  shown here while passing your jar.
 
  but another says:
 
  application-jar: Path to a bundled jar including your application and
 all
  dependencies. The URL must be globally visible inside of your cluster,
 for
  instance, an hdfs:// path or a file:// path that is present on all
 nodes.
 
  I suppose both could be correct if you take a certain point of view.
 
  --
  Jeremy Lee  BCompSci(Hons)
The Unorthodox Engineers

Re: Can't seem to link external/twitter classes from my own app

2014-06-04 Thread Nick Pentreath

The magic incantation is sbt assembly (not assemble).


Actually I find maven with their assembly plugins to be very easy (mvn 
package). I can send a Pom.xml for a skeleton project if you need
—
Sent from Mailbox

On Thu, Jun 5, 2014 at 6:59 AM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:

 Hmm.. That's not working so well for me. First, I needed to add a
 project/plugin.sbt file with the contents:
 addSbtPlugin(com.eed3si9n % sbt-assembly % 0.11.4)
 Before 'sbt/sbt assemble' worked at all. And I'm not sure about that
 version number, but 0.9.1 isn't working much better and 11.4 is the
 latest one recommended by the sbt project site. Where did you get your
 version from?
 Second, even when I do get it to build a .jar, spark-submit is still
 telling me the external.twitter library is missing.
 I tried using your github project as-is, but it also complained about the
 missing plugin.. I'm trying it with various versions now to see if I can
 get that working, even though I don't know anything about kafka. Hmm, and
 no. Here's what I get:
 [info] Set current project to Simple Project (in build
 file:/home/ubuntu/spark-1.0.0/SparkKafka/)
 [error] Not a valid command: assemble
 [error] Not a valid project ID: assemble
 [error] Expected ':' (if selecting a configuration)
 [error] Not a valid key: assemble (similar: assembly, assemblyJarName,
 assemblyDirectory)
 [error] assemble
 [error]
 I also found this project which seemed to be exactly what I was after:
 https://github.com/prabeesh/SparkTwitterAnalysis
 ...but it was for Spark 0.9, and though I updated all the version
 references to 1.0.0, that one doesn't work either. I can't even get it to
 build.
 *sigh*
 Is it going to be easier to just copy the external/ source code into my own
 project? Because I will... especially if creating Uberjars takes this
 long every... single... time...
 On Thu, Jun 5, 2014 at 8:52 AM, Jeremy Lee unorthodox.engine...@gmail.com
 wrote:
 Thanks Patrick!

 Uberjars. Cool. I'd actually heard of them. And thanks for the link to the
 example! I shall work through that today.

 I'm still learning sbt and it's many options... the last new framework I
 learned was node.js, and I think I've been rather spoiled by npm.

 At least it's not maven. Please, oh please don't make me learn maven too.
 (The only people who seem to like it have Software Stockholm Syndrome: I
 know maven kidnapped me and beat me up, but if you spend long enough with
 it, you eventually start to sympathize and see it's point of view.)


 On Thu, Jun 5, 2014 at 3:39 AM, Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Jeremy,

 The issue is that you are using one of the external libraries and
 these aren't actually packaged with Spark on the cluster, so you need
 to create an uber jar that includes them.

 You can look at the example here (I recently did this for a kafka
 project and the idea is the same):

 https://github.com/pwendell/kafka-spark-example

 You'll want to make an uber jar that includes these packages (run sbt
 assembly) and then submit that jar to spark-submit. Also, I'd try
 running it locally first (if you aren't already) just to make the
 debugging simpler.

 - Patrick


 On Wed, Jun 4, 2014 at 6:16 AM, Sean Owen so...@cloudera.com wrote:
  Ah sorry, this may be the thing I learned for the day. The issue is
  that classes from that particular artifact are missing though. Worth
  interrogating the resulting .jar file with jar tf to see if it made
  it in?
 
  On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath 
 nick.pentre...@gmail.com wrote:
  @Sean, the %% syntax in SBT should automatically add the Scala major
 version
  qualifier (_2.10, _2.11 etc) for you, so that does appear to be correct
  syntax for the build.
 
  I seemed to run into this issue with some missing Jackson deps, and
 solved
  it by including the jar explicitly on the driver class path:
 
  bin/spark-submit --driver-class-path
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar --class
 SimpleApp
  SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
 
  Seems redundant to me since I thought that the JAR as argument is
 copied to
  driver and made available. But this solved it for me so perhaps give
 it a
  try?
 
 
 
  On Wed, Jun 4, 2014 at 3:01 PM, Sean Owen so...@cloudera.com wrote:
 
  Those aren't the names of the artifacts:
 
 
 
 http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22
 
  The name is spark-streaming-twitter_2.10
 
  On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee
  unorthodox.engine...@gmail.com wrote:
   Man, this has been hard going. Six days, and I finally got a Hello
   World
   App working that I wrote myself.
  
   Now I'm trying to make a minimal streaming app based on the twitter
   examples, (running standalone right now while learning) and when
 running
   it
   like this:
  
   bin/spark-submit --class SimpleApp
   SimpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
  
   I'm getting this error

Re: Cassandra examples don't work for me

2014-06-05 Thread Nick Pentreath

Yyou need cassandra 1.2.6 for Spark examples —
Sent from Mailbox

On Thu, Jun 5, 2014 at 12:02 AM, Tim Kellogg t...@2lemetry.com wrote:

 Hi,
 I’m following the directions to run the cassandra example 
 “org.apache.spark.examples.CassandraTest” and I get this error
 Exception in thread main java.lang.IncompatibleClassChangeError: Found 
 interface org.apache.hadoop.mapreduce.JobContext, but class was expected
 at 
 org.apache.cassandra.hadoop.AbstractColumnFamilyInputFormat.getSplits(AbstractColumnFamilyInputFormat.java:113)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:90)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:59)
 at 
 org.apache.spark.rdd.PairRDDFunctions.reduceByKey(PairRDDFunctions.scala:370)
 at org.apache.spark.examples.CassandraTest$.main(CassandraTest.scala:100)
 at org.apache.spark.examples.CassandraTest.main(CassandraTest.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 I’m running Cassandra version 2.0.6, and this comes from the 
 spark-1.0.0-bin-hadoop2 distribution package. I am running the example with 
 this commandline:
 bin/run-example org.apache.spark.examples.CassandraTest localhost localhost 
 9160
 I suspect it’s because I’m running the wrong version of Cassandra, but I 
 can’t find the correct version listed anywhere. I hope this is an easy issue 
 to address.
 Much thanks, Tim

Re: compress in-memory cache?

2014-06-05 Thread Nick Pentreath

Have you set the persistence level of the RDD to MEMORY_ONLY_SER (
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)?
If you're calling cache, the default persistence level is MEMORY_ONLY so
that setting will have no impact.


On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen xche...@gmail.com wrote:

 I have a working set larger than available memory, thus I am hoping to
 turn on rdd compression so that I can store more in-memory. Strangely it
 made no difference. The number of cached partitions, fraction cached, and
 size in memory remain the same. Any ideas?

 I confirmed that rdd compression wasn't on before and it was on for the
 second test.

 scala sc.getConf.getAll foreach println
 ...
 (spark.rdd.compress,true)
 ...

 I haven't tried lzo vs snappy, but my guess is that either one should
 provide at least some benefit..

 Thanks.
 -Simon

Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath

Ah looking at that inputformat it should just work out the box using 
sc.newAPIHadoopFile ...


Would be interested to hear if it works as expected for you (in python you'll 
end up with bytearray values).




N
—
Sent from Mailbox

On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman freeman.jer...@gmail.com
wrote:

 Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
 support. We recently wrote a custom hadoop input format so we can support
 flat binary files
 (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
 and have been testing it in Scala. So I was following Nick's progress and
 was eager to check this out when ready. Will let you guys know how it goes.
 -- J
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p7144.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Nick Pentreath

When you use match, the match must be exhaustive. That is, a match error is 
thrown if the match fails. 


That's why you usually handle the default case using case _ = ...




Here it looks like your taking the text of all statuses - which means not all 
of them will be commands... Which means your match will not be exhaustive.




The solution is either to add a default case which does nothing, or probably 
better to add a .filter such that you filter out anything that's not a command 
before matching.




Just looking at it again it could also be that you take x = x._2._1 ... What 
type is that? Should it not be a Seq if you're joining, in which case the match 
will also fail...




Hope this helps.
—
Sent from Mailbox

On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:

 I shut down my first (working) cluster and brought up a fresh one... and
 It's been a bit of a horror and I need to sleep now. Should I be worried
 about these errors? Or did I just have the old log4j.config tuned so I
 didn't see them?
 I
 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
 job 1402245172000 ms.2
 scala.MatchError: 0101-01-10 (of class java.lang.String)
 at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
 at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 The error comes from this code, which seemed like a sensible way to match
 things:
 (The case cmd_plus(w) statement is generating the error,)
 val cmd_plus = [+]([\w]+).r
 val cmd_minus = [-]([\w]+).r
 // find command user tweets
 val commands = stream.map(
 status = ( status.getUser().getId(), status.getText() )
 ).foreachRDD(rdd = {
 rdd.join(superusers).map(
 x = x._2._1
 ).collect().foreach{ cmd = {
 218:  cmd match {
 case cmd_plus(w) = {
 ...
 } case cmd_minus(w) = { ... } } }} })
 It seems a bit excessive for scala to throw exceptions because a regex
 didn't match. Something feels wrong.

Re: mllib, python and SVD

2014-06-09 Thread Nick Pentreath

Don't think SVD is exposed via MLlib in Python yet,

but you can also check out: https://github.com/ogrisel/spylearn where
Jeremy Freeman put together a numpy-based SVD algorithm (this is a bit
outdated but should still work I assume) (also
https://github.com/freeman-lab/thunder has a PCA implementation).


On Mon, Jun 9, 2014 at 11:32 AM, Håvard Wahl Kongsgård 
haavard.kongsga...@gmail.com wrote:

 Hi, is it possible to do Singular value decomposition (SVD) with python in
 spark(1.0.0)?


 -Havard WK

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-10 Thread Nick Pentreath

Can you key your RDD by some key and use reduceByKey? In fact if you are 
merging bunch of maps you can create a set of (k, v) in your mapPartitions and 
then reduceByKey using some merge function. The reduce will happen in parallel 
on multiple nodes in this case. You'll end up with just a single set of k, v 
per partition which you can reduce or collect and merge on the driver.




—
Sent from Mailbox

On Tue, Jun 10, 2014 at 1:05 AM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:

 I suppose what I want is the memory efficiency of toLocalIterator and the
 speed of collect. Is there any such thing?
 On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung coded...@cs.stanford.edu
 wrote:
 Hello,

 I noticed that the final reduce function happens in the driver node with a
 code that looks like the following.

 val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) {
  a.merge(b)
 }

 although individual outputs from mappers are small. Over time the
 aggregated result outputMap could be huuuge (say with hundreds of millions
 of keys and values, reaching giga bytes).

 I noticed that, even if we have a lot of memory in the driver node, this
 process becomes realy slow eventually (say we have 100+ partitions. the
 first reduce is fast, but progressively, it becomes veeery slow as more and
 more partition outputs get aggregated). Is this because the intermediate
 reduce output gets serialized and then deserialized every time?

 What I'd like ideally is, since reduce is taking place in the same machine
 any way, there's no need for any serialization and deserialization, and
 just aggregate the incoming results into the final aggregation. Is this
 possible?

RE: Question about RDD cache, unpersist, materialization

2014-06-11 Thread Nick Pentreath

If you want to force materialization use .count()


Also if you can simply don't unpersist anything, unless you really need to free 
the memory 
—
Sent from Mailbox

On Wed, Jun 11, 2014 at 5:13 AM, innowireless TaeYun Kim
taeyun@innowireless.co.kr wrote:

 BTW, it is possible that rdd.first() does not compute the whole partitions.
 So, first() cannot be uses for the situation below.
 -Original Message-
 From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr] 
 Sent: Wednesday, June 11, 2014 11:40 AM
 To: user@spark.apache.org
 Subject: Question about RDD cache, unpersist, materialization
 Hi,
 What I (seems to) know about RDD persisting API is as follows:
 - cache() and persist() is not an action. It only does a marking.
 - unpersist() is also not an action. It only removes a marking. But if the
 rdd is already in memory, it is unloaded.
 And there seems no API to forcefully materialize the RDD without requiring a
 data by an action method, for example first().
 So, I am faced with the following scenario.
 {
 JavaRDDT rddUnion = sc.parallelize(new ArrayListT());  // create
 empty for merging
 for (int i = 0; i  10; i++)
 {
 JavaRDDT2 rdd = sc.textFile(inputFileNames[i]);
 rdd.cache();  // Since it will be used twice, cache.
 rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]);  //
 Transform and save, rdd materializes
 rddUnion = rddUnion.union(rdd.map(...).filter(...));  // Do another
 transform to T and merge by union
 rdd.unpersist();  // Now it seems not needed. (But needed actually)
 }
 // Here, rddUnion actually materializes, and needs all 10 rdds that
 already unpersisted.
 // So, rebuilding all 10 rdds will occur.
 rddUnion.saveAsTextFile(mergedFileName);
 }
 If rddUnion can be materialized before the rdd.unpersist() line and
 cache()d, the rdds in the loop will not be needed on
 rddUnion.saveAsTextFile().
 Now what is the best strategy?
 - Do not unpersist all 10 rdds in the loop.
 - Materialize rddUnion in the loop by calling 'light' action API, like
 first().
 - Give up and just rebuild/reload all 10 rdds when saving rddUnion.
 Is there some misunderstanding?
 Thanks.

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath

can you not use a Cassandra OutputFormat? Seems they have BulkOutputFormat.
An example of using it with Hadoop is here:
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

Using it with Spark will be similar to the examples:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
and
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala


On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com wrote:

 Hi,

 (My excuses for the cross-post from SO)

 I'm trying to create Cassandra SSTables from the results of a batch
 computation in Spark. Ideally, each partition should create the SSTable for
 the data it holds in order to parallelize the process as much as possible
 (and probably even stream it to the Cassandra ring as well)

 After the initial hurdles with the CQLSSTableWriter (like requiring the
 yaml file), I'm confronted now with this issue:

 java.lang.RuntimeException: Attempting to load already loaded column family 
 customer.rawts
 at org.apache.cassandra.config.Schema.load(Schema.java:347)
 at org.apache.cassandra.config.Schema.load(Schema.java:112)
 at 
 org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

 I'm creating a writer on each parallel partition like this:

 def store(rdd:RDD[Message]) = {
 rdd.foreachPartition( msgIterator = {
   val writer = CQLSSTableWriter.builder()
 .inDirectory(/tmp/cass)
 .forTable(schema)
 .using(insertSttmt).build()
   msgIterator.foreach(msg = {...})
 })}

 And if I'm reading the exception correctly, I can only create one writer
 per table in one JVM. Digging a bit further in the code, it looks like the
 Schema.load(...) singleton enforces that limitation.

 I guess writings to the writer will not be thread-safe and even if they
 were the contention that multiple threads will create by having all
 parallel tasks trying to dump few GB of data to disk at the same time will
 defeat the purpose of using the SSTables for bulk upload anyway.

 So, are there ways to use the CQLSSTableWriter concurrently?

 If not, what is the next best option to load batch data at high throughput
 in Cassandra?

 Will the upcoming Spark-Cassandra integration help with this? (ie. should
 I just sit back, relax and the problem will solve itself?)

 Thanks,

 Gerard.

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-25 Thread Nick Pentreath

Right, ok.

I can't say I've used the Cassandra OutputFormats before. But perhaps if
you use it directly (instead of via Calliope) you may be able to get it to
work, albeit with less concise code?

Or perhaps you may be able to build Cassandra from source with Hadoop 2 /
CDH4 support:
https://groups.google.com/forum/#!topic/nosql-databases/Y-9amAdZk1s

On Wed, Jun 25, 2014 at 9:14 PM, Gerard Maas gerard.m...@gmail.com wrote:

Thanks Nick.

We used the CassandraOutputFormat through Calliope. The Calliope API makes
the CassandraOutputFormat quite accessible and is cool to work with. It
worked fine at prototype level, but we had Hadoop version conflicts when we
put it in our Spark environment (Using our Spark assembly compiled with
CDH4.4). The conflict seems to be at the Cassandra-all lib level, which is
compiled against a different hadoop version (v1).

We could not get round that issue. (Any pointers in that direction?)

That's why I'm trying the direct CQLSSTableWriter way but it looks blocked
as well.

-kr, Gerard.

On Wed, Jun 25, 2014 at 8:57 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

can you not use a Cassandra OutputFormat? Seems they have
BulkOutputFormat. An example of using it with Hadoop is here:
http://shareitexploreit.blogspot.com/2012/03/bulkloadto-cassandra-with-hadoop.html

Using it with Spark will be similar to the examples:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraTest.scala
and
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/CassandraCQLTest.scala

On Wed, Jun 25, 2014 at 8:44 PM, Gerard Maas gerard.m...@gmail.com
wrote:

Hi,

(My excuses for the cross-post from SO)

I'm trying to create Cassandra SSTables from the results of a batch
computation in Spark. Ideally, each partition should create the SSTable for
the data it holds in order to parallelize the process as much as possible
(and probably even stream it to the Cassandra ring as well)

After the initial hurdles with the CQLSSTableWriter (like requiring the
yaml file), I'm confronted now with this issue:

java.lang.RuntimeException: Attempting to load already loaded column family
customer.rawts
at org.apache.cassandra.config.Schema.load(Schema.java:347)
at org.apache.cassandra.config.Schema.load(Schema.java:112)
at
org.apache.cassandra.io.sstable.CQLSSTableWriter$Builder.forTable(CQLSSTableWriter.java:336)

I'm creating a writer on each parallel partition like this:

def store(rdd:RDD[Message]) = {
rdd.foreachPartition( msgIterator = {
val writer = CQLSSTableWriter.builder()
.inDirectory(/tmp/cass)
.forTable(schema)
.using(insertSttmt).build()
msgIterator.foreach(msg = {...})
})}

And if I'm reading the exception correctly, I can only create one writer
per table in one JVM. Digging a bit further in the code, it looks like the
Schema.load(...) singleton enforces that limitation.

I guess writings to the writer will not be thread-safe and even if they
were the contention that multiple threads will create by having all
parallel tasks trying to dump few GB of data to disk at the same time will
defeat the purpose of using the SSTables for bulk upload anyway.

So, are there ways to use the CQLSSTableWriter concurrently?

If not, what is the next best option to load batch data at high
throughput in Cassandra?

Will the upcoming Spark-Cassandra integration help with this? (ie.
should I just sit back, relax and the problem will solve itself?)

Thanks,

Gerard.

Re: ElasticSearch enrich

2014-06-26 Thread Nick Pentreath

You can just add elasticsearch-hadoop as a dependency to your project to
user the ESInputFormat and ESOutputFormat (
https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics
here:
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html

For testing, yes I think you will need to start ES in local mode (just
./bin/elasticsearch) and use the default config (host = localhost, port =
9200).


On Thu, Jun 26, 2014 at 9:04 AM, boci boci.b...@gmail.com wrote:

 That's okay, but hadoop has ES integration. what happened if I run
 saveAsHadoopFile without hadoop (or I must need to pull up hadoop
 programatically? (if I can))

 b0c1


 --
 Skype: boci13, Hangout: boci.b...@gmail.com


 On Thu, Jun 26, 2014 at 1:20 AM, Holden Karau hol...@pigscanfly.ca
 wrote:



 On Wed, Jun 25, 2014 at 4:16 PM, boci boci.b...@gmail.com wrote:

 Hi guys, thanks the direction now I have some problem/question:
 - in local (test) mode I want to use ElasticClient.local to create es
 connection, but in prodution I want to use ElasticClient.remote, to this I
 want to pass ElasticClient to mapPartitions, or what is the best
 practices?

 In this case you probably want to make the ElasticClient inside of
 mapPartitions (since it isn't serializable) and if you want to use a
 different client in local mode just have a flag that control what type of
 client you create.

 - my stream output is write into elasticsearch. How can I
 test output.saveAsHadoopFile[ESOutputFormat](-) in local environment?

 - After store the enriched data into ES, I want to generate aggregated
 data (EsInputFormat) how can I test it in local?

 I think the simplest thing to do would be use the same client in mode and
 just start single node elastic search cluster.


 Thanks guys

 b0c1




 --
 Skype: boci13, Hangout: boci.b...@gmail.com


 On Wed, Jun 25, 2014 at 1:33 AM, Holden Karau hol...@pigscanfly.ca
 wrote:

 So I'm giving a talk at the Spark summit on using Spark 
 ElasticSearch, but for now if you want to see a simple demo which uses
 elasticsearch for geo input you can take a look at my quick  dirty
 implementation with TopTweetsInALocation (
 https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/TopTweetsInALocation.scala
 ). This approach uses the ESInputFormat which avoids the difficulty of
 having to manually create ElasticSearch clients.

 This approach might not work for your data, e.g. if you need to create
 a query for each record in your RDD. If this is the case, you could instead
 look at using mapPartitions and setting up your Elasticsearch connection
 inside of that, so you could then re-use the client for all of the queries
 on each partition. This approach will avoid having to serialize the
 Elasticsearch connection because it will be local to your function.

 Hope this helps!

 Cheers,

 Holden :)


 On Tue, Jun 24, 2014 at 4:28 PM, Mayur Rustagi mayur.rust...@gmail.com
  wrote:

 Its not used as default serializer for some issues with compatibility
  requirement to register the classes..

 Which part are you getting as nonserializable... you need to serialize
 that class if you are sending it to spark workers inside a map, reduce ,
 mappartition or any of the operations on RDD.


 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi



 On Wed, Jun 25, 2014 at 4:52 AM, Peng Cheng pc...@uow.edu.au wrote:

 I'm afraid persisting connection across two tasks is a dangerous act
 as they
 can't be guaranteed to be executed on the same machine. Your ES
 server may
 think its a man-in-the-middle attack!

 I think its possible to invoke a static method that give you a
 connection in
 a local 'pool', so nothing will sneak into your closure, but its too
 complex
 and there should be a better option.

 Never use kryo before, if its that good perhaps we should use it as
 the
 default serializer



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/ElasticSearch-enrich-tp8209p8222.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.





 --
 Cell : 425-233-8271





 --
 Cell : 425-233-8271

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath

Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions




For svm there are a couple of ad click prediction datasets of pretty large size.




For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/



—
Sent from Mailbox

On Thu, Jul 3, 2014 at 3:25 PM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:

 Hello!
 I want to play around with several different cluster settings and measure
 performances for MLlib and GraphX  and was wondering if anybody here could
 hit me up with datasets for these applications from 5GB onwards? 
 I mostly interested in SVM and Triangle Count, but would be glad for any
 help.
 Best regards,
 Alex
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath

The Kaggle data is not in libsvm format so you'd have to do some transformation.


The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad 
prediction data is around 2-3GB compressed I think.




To my knowledge these are the largest binary classification datasets I've come 
across which are easily publicly available (very happy to be proved wrong about 
this though :)
—
Sent from Mailbox

On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:

 Nick Pentreath wrote
 Take a look at Kaggle competition datasets
 - https://www.kaggle.com/competitions
 I was looking for files in LIBSVM format and never found something on Kaggle
 in bigger size. Most competitions I ve seen need data processing and feature
 generating, but maybe I ve to take a second look.
 Nick Pentreath wrote
 For graph stuff the SNAP has large network
 data: https://snap.stanford.edu/data/
 Thanks
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath

You should be able to use DynamoDBInputFormat (I think this should be part
of AWS libraries for Java) and create a HadoopRDD from that.


On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:

 Hi,

 I noticed mention of DynamoDB as input source in

 http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
 .

 Unfortunately, Google is not coming to my rescue on finding
 further mention for this support.

 Any pointers would be well received.

 Big thanks,
 ian

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath

No boto support for that. 

In master there is Python support for loading Hadoop inputFormat. Not sure if 
it will be in 1.0.1 or 1.1

I master docs under the programming guide are instructions and also under 
examples project there are pyspark examples of using Cassandra and HBase. These 
should hopefully give you enough to get started. 

Depending on how easy it is to use the dynamo DB format, you may have to write 
a custom converter (see the mentioned examples for storm details).

Sent from my iPhone

 On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote:
 
 Hi Nick,
 
 I’m going to be working with python primarily. Are you aware of
 comparable boto support?
 
 ian
 
 On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote:
 
 You should be able to use DynamoDBInputFormat (I think this should be part 
 of AWS libraries for Java) and create a HadoopRDD from that.
 
 
 On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:
 Hi,
 
 I noticed mention of DynamoDB as input source in
 http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.
 
 Unfortunately, Google is not coming to my rescue on finding
 further mention for this support.
 
 Any pointers would be well received.
 
 Big thanks,
 ian

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath

I should qualify by saying there is boto support for dynamodb - but not for the
inputFormat. You could roll your own python-based connection but this involves
figuring out how to split the data in dynamo - inputFormat takes care of this
so should be the easier approach —
Sent from Mailbox

On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote:

Excellent. Let me get browsing on this.
Huge thanks,
ian
On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com wrote:
No boto support for that.

In master there is Python support for loading Hadoop inputFormat. Not sure
if it will be in 1.0.1 or 1.1

I master docs under the programming guide are instructions and also under
examples project there are pyspark examples of using Cassandra and HBase.
These should hopefully give you enough to get started.

Depending on how easy it is to use the dynamo DB format, you may have to
write a custom converter (see the mentioned examples for storm details).

Sent from my iPhone

On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote:

Hi Nick,

I’m going to be working with python primarily. Are you aware of
comparable boto support?

ian

On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote:

You should be able to use DynamoDBInputFormat (I think this should be part
of AWS libraries for Java) and create a HadoopRDD from that.

On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:
Hi,

I noticed mention of DynamoDB as input source in
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf.

Unfortunately, Google is not coming to my rescue on finding
further mention for this support.

Any pointers would be well received.

Big thanks,
ian

Re: DynamoDB input source

2014-07-04 Thread Nick Pentreath

Interesting - I would have thought they would make that available publicly.
Unfortunately, unless you can use Spark on EMR, I guess your options are to
hack it by spinning up an EMR cluster and getting the JAR, or maybe fall
back to using boto and rolling your own :(

On Fri, Jul 4, 2014 at 9:28 AM, Ian Wilkinson ia...@me.com wrote:

Trying to discover source for the DynamoDBInputFormat.
Not appearing in:

- https://github.com/aws/aws-sdk-java
- https://github.com/apache/hive

Then came across
http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb
.
Unsure whether this represents the latest situation…

ian

On 4 Jul 2014, at 16:58, Nick Pentreath nick.pentre...@gmail.com wrote:

I should qualify by saying there is boto support for dynamodb - but not
for the inputFormat. You could roll your own python-based connection but
this involves figuring out how to split the data in dynamo - inputFormat
takes care of this so should be the easier approach
—
Sent from Mailbox https://www.dropbox.com/mailbox

On Fri, Jul 4, 2014 at 8:51 AM, Ian Wilkinson ia...@me.com wrote:

Excellent. Let me get browsing on this.

Huge thanks,
ian

On 4 Jul 2014, at 16:47, Nick Pentreath nick.pentre...@gmail.com
wrote:

No boto support for that.

In master there is Python support for loading Hadoop inputFormat. Not
sure if it will be in 1.0.1 or 1.1

Depending on how easy it is to use the dynamo DB format, you may have to
write a custom converter (see the mentioned examples for storm details).

Sent from my iPhone

On 4 Jul 2014, at 08:38, Ian Wilkinson ia...@me.com wrote:

Hi Nick,

I’m going to be working with python primarily. Are you aware of
comparable boto support?

ian

On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com
wrote:

You should be able to use DynamoDBInputFormat (I think this should be
part of AWS libraries for Java) and create a HadoopRDD from that.

On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:

Hi,

I noticed mention of DynamoDB as input source in

http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
.

Unfortunately, Google is not coming to my rescue on finding
further mention for this support.

Any pointers would be well received.

Big thanks,
ian

Re: taking top k values of rdd

2014-07-05 Thread Nick Pentreath

To make it efficient in your case you may need to do a bit of custom code to 
emit the top k per partition and then only send those to the driver. On the 
driver you can just top k the combined top k from each partition (assuming you 
have (object, count) for each top k list).

—
Sent from Mailbox

On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers ko...@tresata.com wrote:

 my initial approach to taking top k values of a rdd was using a
 priority-queue monoid. along these lines:
 rdd.mapPartitions({ items = Iterator.single(new PriorityQueue(...)) },
 false).reduce(monoid.plus)
 this works fine, but looking at the code for reduce it first reduces within
 a partition (which doesnt help me) and then sends the results to the driver
 where these again get reduced. this means that for every partition the
 (potentially very bulky) priorityqueue gets shipped to the driver.
 my driver is client side, not inside cluster, and i cannot change this, so
 this shipping to driver of all these queues can be expensive.
 is there a better way to do this? should i try to a shuffle first to reduce
 the partitions to the minimal amount (since number of queues shipped is
 equal to number of partitions)?
 is was a way to reduce to a single item RDD, so the queues stay inside
 cluster and i can retrieve the final result with RDD.first?

Re: taking top k values of rdd

2014-07-05 Thread Nick Pentreath

Right. That is unavoidable unless as you say you repartition into 1 partition, 
which may do the trick.


When I say send the top k per partition I don't mean send the pq but the actual 
values. This may end up being relatively small if k and p are not too big. (I'm 
not sure how large serialized pq is).
—
Sent from Mailbox

On Sat, Jul 5, 2014 at 10:29 AM, Koert Kuipers ko...@tresata.com wrote:

 hey nick,
 you are right. i didnt explain myself well and my code example was wrong...
 i am keeping a priority-queue with k items per partition (using
 com.twitter.algebird.mutable.PriorityQueueMonoid.build to limit the sizes
 of the queues).
 but this still means i am sending k items per partition to my driver, so k
 x p, while i only need k.
 thanks! koert
 On Sat, Jul 5, 2014 at 1:21 PM, Nick Pentreath nick.pentre...@gmail.com
 wrote:
 To make it efficient in your case you may need to do a bit of custom code
 to emit the top k per partition and then only send those to the driver. On
 the driver you can just top k the combined top k from each partition
 (assuming you have (object, count) for each top k list).

 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Sat, Jul 5, 2014 at 10:17 AM, Koert Kuipers ko...@tresata.com wrote:

 my initial approach to taking top k values of a rdd was using a
 priority-queue monoid. along these lines:

  rdd.mapPartitions({ items = Iterator.single(new PriorityQueue(...)) },
 false).reduce(monoid.plus)

 this works fine, but looking at the code for reduce it first reduces
 within a partition (which doesnt help me) and then sends the results to the
 driver where these again get reduced. this means that for every partition
 the (potentially very bulky) priorityqueue gets shipped to the driver.

 my driver is client side, not inside cluster, and i cannot change this,
 so this shipping to driver of all these queues can be expensive.

 is there a better way to do this? should i try to a shuffle first to
 reduce the partitions to the minimal amount (since number of queues shipped
 is equal to number of partitions)?

 is was a way to reduce to a single item RDD, so the queues stay inside
 cluster and i can retrieve the final result with RDD.first?

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-05 Thread Nick Pentreath

For linear models the 3rd option is by far most efficient and I suspect what 
Evan is alluding to. 


Unfortunately it's not directly possible with the classes in Mllib now so 
you'll have to roll your own using underlying sgd / bfgs primitives.
—
Sent from Mailbox

On Sat, Jul 5, 2014 at 10:45 AM, Christopher Nguyen c...@adatao.com
wrote:

 Hi sparkuser2345,
 I'm inferring the problem statement is something like how do I make this
 complete faster (given my compute resources)?
 Several comments.
 First, Spark only allows launching parallel tasks from the driver, not from
 workers, which is why you're seeing the exception when you try. Whether the
 latter is a sensible/doable idea is another discussion, but I can
 appreciate why many people assume this should be possible.
 Second, on optimization, you may be able to apply Sean's idea about
 (thread) parallelism at the driver, combined with the knowledge that often
 these cluster tasks bottleneck while competing for the same resources at
 the same time (cpu vs disk vs network, etc.) You may be able to achieve
 some performance optimization by randomizing these timings. This is not
 unlike GMail randomizing user storage locations around the world for load
 balancing. Here, you would partition each of your RDDs into a different
 number of partitions, making some tasks larger than others, and thus some
 may be in cpu-intensive map while others are shuffling data around the
 network. This is rather cluster-specific; I'd be interested in what you
 learn from such an exercise.
 Third, I find it useful always to consider doing as much as possible in one
 pass, subject to memory limits, e.g., mapPartitions() vs map(), thus
 minimizing map/shuffle/reduce boundaries with their context switches and
 data shuffling. In this case, notice how you're running the
 training+prediction k times over mostly the same rows, with map/reduce
 boundaries in between. While the training phase is sealed in this context,
 you may be able to improve performance by collecting all the k models
 together, and do a [m x k] predictions all at once which may end up being
 faster.
 Finally, as implied from the above, for the very common k-fold
 cross-validation pattern, the algorithm itself might be written to be smart
 enough to take both train and test data and do the right thing within
 itself, thus obviating the need for the user to prepare k data sets and
 running over them serially, and likely saving a lot of repeated
 computations in the right internal places.
 Enjoy,
 --
 Christopher T. Nguyen
 Co-founder  CEO, Adatao http://adatao.com
 linkedin.com/in/ctnguyen
 On Sat, Jul 5, 2014 at 1:50 AM, Sean Owen so...@cloudera.com wrote:
 If you call .par on data_kfolded it will become a parallel collection in
 Scala and so the maps will happen in parallel .
 On Jul 5, 2014 9:35 AM, sparkuser2345 hm.spark.u...@gmail.com wrote:

 Hi,

 I am trying to fit a logistic regression model with cross validation in
 Spark 0.9.0 using SVMWithSGD. I have created an array data_kfolded where
 each element is a pair of RDDs containing the training and test data:

 (training_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint],
 test_data: (RDD[org.apache.spark.mllib.regression.LabeledPoint])

 scala data_kfolded
 res21:

 Array[(org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],
 org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])]
 =
 Array((MappedRDD[9] at map at console:24,MappedRDD[7] at map at
 console:23), (MappedRDD[13] at map at console:24,MappedRDD[11] at map
 at
 console:23), (MappedRDD[17] at map at console:24,MappedRDD[15] at map
 at
 console:23))

 Everything works fine when using data_kfolded:

 val validationErrors =
 data_kfolded.map { datafold =
   val svmAlg = new SVMWithSGD()
   val model_reg = svmAlg.run(datafold._1)
   val labelAndPreds = datafold._2.map { point =
 val prediction = model_reg.predict(point.features)
 (point.label, prediction)
   }
   val trainErr = labelAndPreds.filter(r = r._1 != r._2).count.toDouble /
 datafold._2.count
   trainErr.toDouble
 }

 scala validationErrors
 res1: Array[Double] = Array(0.8819836785938481, 0.07082521117608837,
 0.29833546734955185)

 However, I have understood that the models are not fitted in parallel as
 data_kfolded is not an RDD (although it's an array of pairs of RDDs). When
 running the same code where data_kfolded has been replaced with
 sc.parallelize(data_kfolded), I get a null pointer exception from the line
 where the run method of the SVMWithSGD object is called with the traning
 data. I guess this is somehow related to the fact that RDDs can't be
 accessed from inside a closure. I fail to understand though why the first
 version works and the second doesn't. Most importantly, is there a way to
 fit the models in parallel? I would really appreciate your help.

 val validationErrors =
 sc.parallelize(data_kfolded).map { datafold =
   val svmAlg = new SVMWithSGD()
   val

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath

You may look into the new Azkaban - which while being quite heavyweight is
actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass
paremeters between jobs. It supports dependencies, simple dags and scheduling
with retries.

I'm digging deeper and it may be worthwhile extending it with a Spark job
type...

It's probably best for mixed Hadoop / Spark clusters...
—
Sent from Mailbox

On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com
wrote:

I used both - Oozie and Luigi - but found them inflexible and still
overcomplicated, especially in presence of Spark.
Oozie has a fixed list of building blocks, which is pretty limiting. For
example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
out of scope (of course, you can always write wrapper as Java or Shell
action, but does it really need to be so complicated?). Another issue with
Oozie is passing variables between actions. There's Oozie context that is
suitable for passing key-value pairs (both strings) between actions, but
for more complex objects (say, FileInputStream that should be closed at
last step only) you have to do some advanced kung fu.
Luigi, on other hand, has its niche - complicated dataflows with many tasks
that depend on each other. Basically, there are tasks (this is where you
define computations) and targets (something that can exist - file on
disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
creates a plan for achieving this. Luigi is really shiny when your workflow
fits this model, but one step away and you are in trouble. For example,
consider simple pipeline: run MR job and output temporary data, run another
MR job and output final data, clean temporary data. You can make target
Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
right? Not so easy. How do you check that Clean task is achieved? If you
just test whether temporary directory is empty or not, you catch both cases
- when all tasks are done and when they are not even started yet. Luigi
allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
run() method, but ruins the entire idea.
And of course, both of these frameworks are optimized for standard
MapReduce jobs, which is probably not what you want on Spark mailing list
:)
Experience with these frameworks, however, gave me some insights about
typical data pipelines.
1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks
allow branching, but most pipelines actually consist of moving data from
source to destination with possibly some transformations in between (I'll
be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing.
Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or
two.
So eventually I decided that it is much easier to create your own pipeline
instead of trying to adopt your code to existing frameworks. My latest
pipeline incarnation simply consists of a list of steps that are started
sequentially. Each step is a class with at least these methods:
* run() - launch this step
* fail() - what to do if step fails
* finalize() - (optional) what to do when all steps are done
For example, if you want to add possibility to run Spark jobs, you just
create SparkStep and configure it with required code. If you want Hive
query - just create HiveStep and configure it with Hive connection
settings. I use YAML file to configure steps and Context (basically,
Map[String, Any]) to pass variables between them. I also use configurable
Reporter available for all steps to report the progress.
Hopefully, this will give you some insights about best pipeline for your
specific case.
On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote:

We use Luigi for this purpose. (Our pipelines are typically on AWS (no
EMR) backed by S3 and using combinations of Python jobs, non-Spark
Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to
the master, and those are what is invoked from Luigi.)

—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/

On Thu, Jul 10, 2014 at 10:20 AM, k.tham kevins...@gmail.com wrote:

I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D,
and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Nick Pentreath

Did you use old azkaban or azkaban 2.5? It has been completely rewritten.

Not saying it is the best but I found it way better than oozie for example.

Sent from my iPhone

 On 11 Jul 2014, at 09:24, 明风 mingf...@taobao.com wrote:
 
 We use Azkaban for a short time and suffer a lot. Finally we almost rewrite 
 it totally. Don’t recommend it really.
 
 发件人: Nick Pentreath nick.pentre...@gmail.com
 答复: user@spark.apache.org
 日期: 2014年7月11日 星期五 下午3:18
 至: user@spark.apache.org
 主题: Re: Recommended pipeline automation tool? Oozie?
 
 You may look into the new Azkaban - which while being quite heavyweight is 
 actually quite pleasant to use when set up.
 
 You can run spark jobs (spark-submit) using azkaban shell commands and pass 
 paremeters between jobs. It supports dependencies, simple dags and scheduling 
 with retries. 
 
 I'm digging deeper and it may be worthwhile extending it with a Spark job 
 type...
 
 It's probably best for mixed Hadoop / Spark clusters...
 —
 Sent from Mailbox
 
 
 On Fri, Jul 11, 2014 at 12:52 AM, Andrei faithlessfri...@gmail.com wrote:
 I used both - Oozie and Luigi - but found them inflexible and still 
 overcomplicated, especially in presence of Spark. 
 
 Oozie has a fixed list of building blocks, which is pretty limiting. For 
 example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out 
 of scope (of course, you can always write wrapper as Java or Shell action, 
 but does it really need to be so complicated?). Another issue with Oozie is 
 passing variables between actions. There's Oozie context that is suitable 
 for passing key-value pairs (both strings) between actions, but for more 
 complex objects (say, FileInputStream that should be closed at last step 
 only) you have to do some advanced kung fu. 
 
 Luigi, on other hand, has its niche - complicated dataflows with many tasks 
 that depend on each other. Basically, there are tasks (this is where you 
 define computations) and targets (something that can exist - file on disk, 
 entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates 
 a plan for achieving this. Luigi is really shiny when your workflow fits 
 this model, but one step away and you are in trouble. For example, consider 
 simple pipeline: run MR job and output temporary data, run another MR job 
 and output final data, clean temporary data. You can make target Clean, that 
 depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so 
 easy. How do you check that Clean task is achieved? If you just test whether 
 temporary directory is empty or not, you catch both cases - when all tasks 
 are done and when they are not even started yet. Luigi allows you to specify 
 all 3 actions - MRJob1, MRJob2, Clean - in a single run() method, but 
 ruins the entire idea. 
 
 And of course, both of these frameworks are optimized for standard MapReduce 
 jobs, which is probably not what you want on Spark mailing list :) 
 
 Experience with these frameworks, however, gave me some insights about 
 typical data pipelines. 
 
 1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks 
 allow branching, but most pipelines actually consist of moving data from 
 source to destination with possibly some transformations in between (I'll be 
 glad if somebody share use cases when you really need branching). 
 2. Transactional logic is important. Either everything, or nothing. 
 Otherwise it's really easy to get into inconsistent state. 
 3. Extensibility is important. You never know what will need in a week or 
 two. 
 
 So eventually I decided that it is much easier to create your own pipeline 
 instead of trying to adopt your code to existing frameworks. My latest 
 pipeline incarnation simply consists of a list of steps that are started 
 sequentially. Each step is a class with at least these methods: 
 
  * run() - launch this step
  * fail() - what to do if step fails
  * finalize() - (optional) what to do when all steps are done
 
 For example, if you want to add possibility to run Spark jobs, you just 
 create SparkStep and configure it with required code. If you want Hive query 
 - just create HiveStep and configure it with Hive connection settings. I use 
 YAML file to configure steps and Context (basically, Map[String, Any]) to 
 pass variables between them. I also use configurable Reporter available for 
 all steps to report the progress. 
 
 Hopefully, this will give you some insights about best pipeline for your 
 specific case. 
 
 
 
 On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown p...@mult.ifario.us wrote:
 
 We use Luigi for this purpose.  (Our pipelines are typically on AWS (no 
 EMR) backed by S3 and using combinations of Python jobs, non-Spark 
 Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to 
 the master, and those are what is invoked from Luigi.)
 
 —
 p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
 
 
 On Thu, Jul 10, 2014 at 10:20 AM, k.tham

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-15 Thread Nick Pentreath

You could try the following: create a minimal project using sbt or Maven,
add spark-streaming-twitter as a dependency, run sbt assembly (or mvn
package) on that to create a fat jar (with Spark as provided dependency),
and add that to the shell classpath when starting up.


On Tue, Jul 15, 2014 at 9:06 AM, Praveen Seluka psel...@qubole.com wrote:

 If you want to make Twitter* classes available in your shell, I believe
 you could do the following
 1. Change the parent pom module ordering - Move external/twitter before
 assembly
 2. In assembly/pom.xm, add external/twitter dependency - this will package
 twitter* into the assembly jar

 Now when spark-shell is launched, assembly jar is in classpath - hence
 twitter* too. I think this will work (remember trying this sometime back)


 On Tue, Jul 15, 2014 at 11:59 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hmm, I'd like to clarify something from your comments, Tathagata.

 Going forward, is Twitter Streaming functionality not supported from the
 shell? What should users do if they'd like to process live Tweets from the
 shell?

 Nick


 On Mon, Jul 14, 2014 at 11:50 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 At some point, you were able to access TwitterUtils from spark shell
 using Spark 1.0.0+ ?


 Yep.


 If yes, then what change in Spark caused it to not work any more?


 It still works for me. I was just commenting on your remark that it
 doesn't work through the shell, which I now understand to apply to versions
 of Spark before 1.0.0.

  Nick

Re: Count distinct with groupBy usage

2014-07-15 Thread Nick Pentreath

You can use .distinct.count on your user RDD.


What are you trying to achieve with the time group by?
—
Sent from Mailbox

On Tue, Jul 15, 2014 at 8:14 PM, buntu buntu...@gmail.com wrote:

 Hi --
 New to Spark and trying to figure out how to do a generate unique counts per
 page by date given this raw data:
 timestamp,page,userId
 1405377264,google,user1
 1405378589,google,user2
 1405380012,yahoo,user1
 ..
 I can do a groupBy a field and get the count:
 val lines=sc.textFile(data.csv)
 val csv=lines.map(_.split(,))
 // group by page
 csv.groupBy(_(1)).count
 But not able to see how to do count distinct on userId and also apply
 another groupBy on timestamp field. Please let me know how to handle such
 cases. 
 Thanks!
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

2014-07-18 Thread Nick Pentreath

It is very true that making predictions in batch for all 1 million users
against the 10k items will be quite onerous in terms of computation. I have
run into this issue too in making batch predictions.

Some ideas:

1. Do you really need to generate recommendations for each user in batch?
How are you serving these recommendations? In most cases, you only need to
make recs when a user is actively interacting with your service or product
etc. Doing it all in batch tends to be a big waste of computation resources.

In our system for example we are serving them in real time (as a user
arrives at a web page, say, our customer hits our API for recs), so we only
generate the rec at that time. You can take a look at Oryx for this (
https://github.com/cloudera/oryx) though it does not yet support Spark, you
may be able to save the model into the correct format in HDFS and have Oryx
read the data.

2. If you do need to make the recs in batch, then I would suggest:
(a) because you have few items, I would collect the item vectors and form a
matrix.
(b) broadcast that matrix
(c) do a mapPartitions on the user vectors. Form a user matrix from the
vectors in each partition (maybe create quite a few partitions to make each
user matrix not too big)
(d) do a value call on the broadcasted item matrix
(e) now for each partition you have the (small) item matrix and a (larger)
user matrix. Do a matrix multiply and you end up with a (U x I) matrix with
the scores for each user in the partition. Because you are using BLAS here,
it will be significantly faster than individually computed dot products
(f) sort the scores for each user and take top K
(g) save or collect and do whatever with the scores

3. in conjunction with (2) you can try throwing more resources at the
problem too

If you access the underlying Breeze vectors (I think the toBreeze method is
private so you may have to re-implement it), you can do all this using
Breeze (e.g. concatenating vectors to make matrices, iterating and whatnot).

Hope that helps

Nick


On Fri, Jul 18, 2014 at 1:17 AM, m3.sharma sharm...@umn.edu wrote:

 Yes, thats what prediction should be doing, taking dot products or sigmoid
 function for each user,item pair. For 1 million users and 10 K items data
 there are 10 billion pairs.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10107.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

2014-07-18 Thread Nick Pentreath

Agree GPUs may be interesting for this kind of massively parallel linear
algebra on reasonable size vectors.

These projects might be of interest in this regard:
https://github.com/BIDData/BIDMach
https://github.com/BIDData/BIDMat
https://github.com/dlwh/gust

Nick



On Fri, Jul 18, 2014 at 7:40 PM, m3.sharma sharm...@umn.edu wrote:

 Thanks Nick real-time suggestion is good, will see if we can add that to
 our
 deployment strategy and you are correct we may not need recommendation for
 each user.

 Will try adding more resources and broadcasting item features suggestion as
 currently they don't seem to be huge.

 As users and items both will continue to grow in future for faster vector
 computations I think few GPU nodes will suffice to serve faster
 recommendation after learning model with SPARK. It will be great to have
 builtin GPU support in SPARK for faster computations to leverage GPU
 capability of nodes for performing these flops faster.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10183.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Nick Pentreath

I got this working locally a little while ago when playing around with
AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211

But not sure about AvroSequenceFile. Any chance you have an example
datafile or records?



On Sat, Jul 19, 2014 at 11:00 AM, Sparky gullo_tho...@bah.com wrote:

 To be more specific, I'm working with a system that stores data in
 org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is
 A wrapper around a Hadoop SequenceFile that also supports reading and
 writing Avro data.

 It seems that Spark does not support this out of the box.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark clustered client

2014-07-23 Thread Nick Pentreath

At the moment your best bet for sharing SparkContexts across jobs will be 
Ooyala job server: https://github.com/ooyala/spark-jobserver


It doesn't yet support spark 1.0 though I did manage to amend it to get it to 
build and run on 1.0
—
Sent from Mailbox

On Wed, Jul 23, 2014 at 1:21 AM, Asaf Lahav asaf.la...@gmail.com wrote:

 Hi Folks,
 I have been trying to dig up some information in regards to what are the
 possibilities when wanting to deploy more than one client process that
 consumes Spark.
 Let's say I have a Spark Cluster of 10 servers, and would like to setup 2
 additional servers which are sending requests to it through a Spark
 context, referencing one specific file of 1TB of data.
 Each client process, has its own SparkContext instance.
 Currently, the result is that that same file is loaded into memory twice
 because the Spark Context resources are not shared between processes/jvms.
 I wouldn't like to have that same file loaded over and over again with
 every new client being introduced.
 What would be the best practice here? Am I missing something?
 Thank you,
 Asaf

Re: Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Nick Pentreath

Load from sequenceFile for PySpark is in master and save is in this PR
underway (https://github.com/apache/spark/pull/1338)

I hope that Kan will have it ready to merge in time for 1.1 release window
(it should be, the PR just needs a final review or two).

In the meantime you can check out master and test out the sequenceFile load
support in PySpark (there are examples in the /examples project and in
python test, and some documentation in /docs)


On Wed, Jul 23, 2014 at 4:42 PM, Gary Malouf malouf.g...@gmail.com wrote:

 I am aware that today PySpark can not load sequence files directly.  Are
 there work-arounds people are using (short of duplicating all the data to
 text files) for accessing this data?

Re: iScala or Scala-notebook

2014-07-29 Thread Nick Pentreath

IScala itself seems to be a bit dead unfortunately.

I did come across this today: https://github.com/tribbloid/ISpark


On Fri, Jul 18, 2014 at 4:59 AM, ericjohnston1989 
ericjohnston1...@gmail.com wrote:

 Hey everyone,

 I know this was asked before but I'm wondering if there have since been any
 updates. Are there any plans to integrate iScala/Scala-notebook with spark
 in the near future?

 This seems like something a lot of people would find very useful, so I was
 just wondering if anyone has started working on it.

 Thanks,

 Eric



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/iScala-or-Scala-notebook-tp10127.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: zip two RDD in pyspark

2014-07-30 Thread Nick Pentreath

parallelize uses the default Serializer (PickleSerializer) while textFile
uses UTF8Serializer.

You can get around this with index.zip(input_data._reserialize())  (or
index.zip(input_data.map(lambda x: x)))

(But if you try to just do this, you run into the issue with different
number of partitions):

index.zip(input_data._reserialize()).count()

Py4JJavaError: An error occurred while calling o60.collect.

: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers
of partitions

at org.apache.spark.rdd.ZippedRDD.getPartitions(ZippedRDD.scala:55)






On Wed, Jul 30, 2014 at 7:53 AM, Davies Liu dav...@databricks.com wrote:

 On Mon, Jul 28, 2014 at 12:58 PM, l lishu...@gmail.com wrote:
  I have a file in s3 that I want to map each line with an index. Here is
 my
  code:
 
  input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache()
  N input_data.count()
  index = sc.parallelize(range(N), 6)
  index.zip(input_data).collect()

 I think you can not do zipWithIndex() in this way, because the number of
 lines in each partition of input_data will be different than index. You
 need
 get the exact number of lines for each partitions first, then generate
 correct
 index. It will be easy to do with mapPartitions()

  nums = input_data.mapPartitions(lambda it: [sum(1 for i in
 it)]).collect()
  starts = [sum(nums[:i]) for i in range(len(nums))]
  zipped = input_data.mapPartitionsWithIndex(lambda i,it: ((starts[i]+j,
 x) for j,x in enumerate(it)))

 
  ...
  14/07/28 19:49:31 INFO DAGScheduler: Completed ResultTask(18, 4)
  14/07/28 19:49:31 INFO DAGScheduler: Stage 18 (collect at stdin:1)
  finished in 0.031 s
  14/07/28 19:49:31 INFO SparkContext: Job finished: collect at stdin:1,
  took 0.03707 s
  Traceback (most recent call last):
File stdin, line 1, in module
File /root/spark/python/pyspark/rdd.py, line 584, in collect
  return list(self._collect_iterator_through_file(bytesInJava))
File /root/spark/python/pyspark/rdd.py, line 592, in
  _collect_iterator_through_file
  self.ctx._writeToFile(iterator, tempFile.name)
File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
  line 537, in __call__
File /root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line
  300, in get_return_value
  py4j.protocol.Py4JJavaError: An error occurred while calling
  z:org.apache.spark.api.python.PythonRDD.writeToFile.
  : java.lang.ClassCastException: java.lang.String cannot be cast to [B
  at
 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$3.apply(PythonRDD.scala:312)
  at
 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$3.apply(PythonRDD.scala:309)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at
 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:309)
  at
 org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:342)
  at
 org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:337)
  at
 org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
  at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
  at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
  at py4j.Gateway.invoke(Gateway.java:259)
  at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
  at py4j.commands.CallCommand.execute(CallCommand.java:79)
  at py4j.GatewayConnection.run(GatewayConnection.java:207)
  at java.lang.Thread.run(Thread.java:744)

  As I see it, the job is completed, but I don't understand what's
 happening
  to 'String cannot be cast to [B'. I tried to zip two
 parallelCollectionRDD
  and it works fine. But here I have a MappedRDD at textFile. Not sure
 what's
  going on here.

 Could you provide an script and dataset to reproduce this error? Maybe
 there are some corner cases during serialization.


  Also, why Python does not have ZipWithIndex()?

 The features in PySpark are much less than Spark, hopefully it will
 catch up in next two releases.

 
  Thanks for any help.
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/zip-two-RDD-in-pyspark-tp10806.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-07 Thread Nick Pentreath

I'm also getting this - Ryan we both seem to be running into this issue
with elasticsearch-hadoop :)

I tried spark.files.userClassPathFirst true on command line and that
doesn;t work

If I put it that line in spark/conf/spark-defaults it works but now I'm
getting:
java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/InputFormat

think I may need to add hadoop-client to my assembly, but any other ideas
welcome.

Ryan, will let you know how I get on


On Mon, Aug 4, 2014 at 10:28 AM, Sean Owen so...@cloudera.com wrote:

 I'm guessing you have the Jackson classes in your assembly but so does
 Spark. Its classloader wins, and does not contain the class present in
 your app's version of Jackson. Try spark.files.userClassPathFirst ?

 On Mon, Aug 4, 2014 at 6:28 AM, Ryan Braley r...@traintracks.io wrote:
  Hi Folks,
 
   I have an assembly jar that I am submitting using spark-submit script
 on a
  cluster I created with the spark-ec2 script. I keep running into the
  java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
  error on my workers even though jar tf clearly shows that class being a
 part
  of my assembly jar. I have the spark program working locally.
  Here is the error log:
  https://gist.github.com/rbraley/cf5cd3457a89b1c0ac88
 
  Anybody have any suggestions of things I can try? It seems
 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1403899110.65393.yahoomail...@web160503.mail.bf1.yahoo.com%3E
  that this is a similar error. I am open to recompiling spark to fix this,
  but I would like to run my job on my cluster rather than just locally.
 
  Thanks,
  Ryan
 
 
  Ryan Braley  |  Founder
  http://traintracks.io/
 
  US: +1 (206) 866 5661
  CN: +86 156 1153 7598
  Coding the future. Decoding the game.
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-08 Thread Nick Pentreath

By the way, for anyone using elasticsearch-hadoop, there is a fix for this
here: https://github.com/elasticsearch/elasticsearch-hadoop/issues/239

Ryan - using the nightly snapshot build of 2.1.0.BUILD-SNAPSHOT fixed this
for me.


On Thu, Aug 7, 2014 at 3:58 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 I'm also getting this - Ryan we both seem to be running into this issue
 with elasticsearch-hadoop :)

 I tried spark.files.userClassPathFirst true on command line and that
 doesn;t work

 If I put it that line in spark/conf/spark-defaults it works but now I'm
 getting:
 java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/InputFormat

 think I may need to add hadoop-client to my assembly, but any other ideas
 welcome.

 Ryan, will let you know how I get on


 On Mon, Aug 4, 2014 at 10:28 AM, Sean Owen so...@cloudera.com wrote:

 I'm guessing you have the Jackson classes in your assembly but so does
 Spark. Its classloader wins, and does not contain the class present in
 your app's version of Jackson. Try spark.files.userClassPathFirst ?

 On Mon, Aug 4, 2014 at 6:28 AM, Ryan Braley r...@traintracks.io wrote:
  Hi Folks,
 
   I have an assembly jar that I am submitting using spark-submit script
 on a
  cluster I created with the spark-ec2 script. I keep running into the
  java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass
  error on my workers even though jar tf clearly shows that class being a
 part
  of my assembly jar. I have the spark program working locally.
  Here is the error log:
  https://gist.github.com/rbraley/cf5cd3457a89b1c0ac88
 
  Anybody have any suggestions of things I can try? It seems
 
 http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3c1403899110.65393.yahoomail...@web160503.mail.bf1.yahoo.com%3E
  that this is a similar error. I am open to recompiling spark to fix
 this,
  but I would like to run my job on my cluster rather than just locally.
 
  Thanks,
  Ryan
 
 
  Ryan Braley  |  Founder
  http://traintracks.io/
 
  US: +1 (206) 866 5661
  CN: +86 156 1153 7598
  Coding the future. Decoding the game.
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Failed running Spark ALS

2014-09-19 Thread Nick Pentreath

Have you set spark.local.dir (I think this is the config setting)?

It needs to point to a volume with plenty of space.

By default if I recall it point to /tmp

Sent from my iPhone

 On 19 Sep 2014, at 23:35, jw.cmu jinliangw...@gmail.com wrote:
 
 I'm trying to run Spark ALS using the netflix dataset but failed due to No
 space on device exception. It seems the exception is thrown after the
 training phase. It's not clear to me what is being written and where is the
 output directory.
 
 I was able to run the same code on the provided test.data dataset.
 
 I'm new to Spark and I'd like to get some hints for resolving this problem.
 
 The code I ran was got from
 https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html (the
 Java version).
 
 Relevant info:
 
 Spark version: 1.0.2 (Standalone deployment)
 # slaves/workers/exectuors: 8
 Core per worker: 64
 memory per executor: 100g
 
 Application parameters are left as default.
 
 
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Failed-running-Spark-ALS-tp14704.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark 1.1.0 - hbase 0.98.6-hadoop2 version - py4j.protocol.Py4JJavaError java.lang.ClassNotFoundException

2014-10-04 Thread Nick Pentreath

forgot to copy user list

On Sat, Oct 4, 2014 at 3:12 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 what version did you put in the pom.xml?

 it does seem to be in Maven central:
 http://search.maven.org/#artifactdetails%7Corg.apache.hbase%7Chbase%7C0.98.6-hadoop2%7Cpom

 dependency
 groupIdorg.apache.hbase/groupId
 artifactIdhbase/artifactId
 version0.98.6-hadoop2/version
 /dependency

 Note you shouldn't need to rebuild Spark, I think just the example project
 via sbt examples/assembly

 On Fri, Oct 3, 2014 at 10:55 AM, serkan.dogan foreignerdr...@yahoo.com
 wrote:

 Hi,
 I installed hbase-0.98.6-hadoop2. It's working not any problem with that.


 When i am try to run spark hbase  python examples, (wordcount examples
 working - not python issue)

  ./bin/spark-submit  --master local --driver-class-path
 ./examples/target/spark-examples_2.10-1.1.0.jar
 ./examples/src/main/python/hbase_inputformat.py localhost myhbasetable

 the process exit with ClassNotFoundException...

 I search lots of blogs, sites all says spark 1.1 version built with hbase
 0.94.6 rebuild with own hbase version.

 I try first,
 change hbase version number - in pom.xml  -- nothing found maven central

 I try second,
 compile hbase from src and copy hbase/lib folder hbase jars to
 spark/lib_managed folder and edit spark-defaults.conf

 my spark-defaults.conf

 spark.executor.extraClassPath

 /home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-server-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-protocol-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-hadoop2-compat-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-client-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/hbase-commont-0.98.6-hadoop2.jar:/home/downloads/spark/spark-1.1.0/lib_managed/jars/htrace-core-2.04.jar


 My question is how i can work with hbase 0.98.6-hadoop2 with spark 1.1.0

 Here is the exception message


 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 14/10/03 11:27:15 WARN Utils: Your hostname, xxx.yyy.com resolves to a
 loopback address: 127.0.0.1; using 1.1.1.1 instead (on interface eth0)
 14/10/03 11:27:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
 another address
 14/10/03 11:27:15 INFO SecurityManager: Changing view acls to: root,
 14/10/03 11:27:15 INFO SecurityManager: Changing modify acls to: root,
 14/10/03 11:27:15 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(root, );
 users
 with modify permissions: Set(root, )
 14/10/03 11:27:16 INFO Slf4jLogger: Slf4jLogger started
 14/10/03 11:27:16 INFO Remoting: Starting remoting
 14/10/03 11:27:16 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkdri...@1-1-1-1-1.rev.mydomain.io:49256]
 14/10/03 11:27:16 INFO Remoting: Remoting now listens on addresses:
 [akka.tcp://sparkdri...@1-1-1-1-1.rev.mydomain.io:49256]
 14/10/03 11:27:16 INFO Utils: Successfully started service 'sparkDriver'
 on
 port 49256.
 14/10/03 11:27:16 INFO SparkEnv: Registering MapOutputTracker
 14/10/03 11:27:16 INFO SparkEnv: Registering BlockManagerMaster
 14/10/03 11:27:16 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20141003112716-298d
 14/10/03 11:27:16 INFO Utils: Successfully started service 'Connection
 manager for block manager' on port 35106.
 14/10/03 11:27:16 INFO ConnectionManager: Bound socket to port 35106 with
 id
 = ConnectionManagerId(1-1-1-1-1.rev.mydomain.io,35106)
 14/10/03 11:27:16 INFO MemoryStore: MemoryStore started with capacity
 267.3
 MB
 14/10/03 11:27:16 INFO BlockManagerMaster: Trying to register BlockManager
 14/10/03 11:27:16 INFO BlockManagerMasterActor: Registering block manager
 1-1-1-1-1.rev.mydomain.io:35106 with 267.3 MB RAM
 14/10/03 11:27:16 INFO BlockManagerMaster: Registered BlockManager
 14/10/03 11:27:16 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-f60b0533-998f-4af2-a208-d04c571eab82
 14/10/03 11:27:16 INFO HttpServer: Starting HTTP Server
 14/10/03 11:27:16 INFO Utils: Successfully started service 'HTTP file
 server' on port 49611.
 14/10/03 11:27:16 INFO Utils: Successfully started service 'SparkUI' on
 port
 4040.
 14/10/03 11:27:16 INFO SparkUI: Started SparkUI at
 http://1-1-1-1-1.rev.mydomain.io:4040
 14/10/03 11:27:16 INFO Utils: Copying

 /home/downloads/spark/spark-1.1.0/./examples/src/main/python/hbase_inputformat.py
 to /tmp/spark-7232227a-0547-454e-9f68-805fa7b0c2f0/hbase_inputformat.py
 14/10/03 11:27:16 INFO SparkContext: Added file

 file:/home/downloads/spark/spark-1.1.0/./examples/src/main/python/hbase_inputformat.py
 at http://1.1.1.1:49611/files/hbase_inputformat.py with timestamp
 1412324836837
 14/10/03 11:27:16 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://
 sparkdri...@1-1-1-1-1.rev.mydomain.io:49256/user/HeartbeatReceiver
 Traceback (most

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath

Currently I see the word2vec model is collected onto the master, so the model
itself is not distributed.

I guess the question is why do you need a distributed model? Is the vocab size
so large that it's necessary? For model serving in general, unless the model is
truly massive (ie cannot fit into memory on a modern high end box with 64, or
128GB ram) then single instance is way faster and simpler (using a cluster of
machines is more for load balancing / fault tolerance).

What is your use case for model serving?

—
Sent from Mailbox

On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh duy.huynh@gmail.com wrote:

you're right, serialization works.
what is your suggestion on saving a distributed model? so part of the
model is in one cluster, and some other parts of the model are in other
clusters. during runtime, these sub-models run independently in their own
clusters (load, train, save). and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.
much appreciated.
On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:
There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.

What are you used to doing in other environments? In R I'm used to running
save(), same with matlab. In python either pickling things or dumping to
json seems pretty common. (even the scikit-learn docs recommend pickling -
http://scikit-learn.org/stable/modules/model_persistence.html). These all
seem basically equivalent java serialization to me..

Would some helper functions (in, say, mllib.util.modelpersistence or
something) make sense to add?

On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh duy.huynh@gmail.com
wrote:

that works. is there a better way in spark? this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.

On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:

Plain old java serialization is one straightforward approach if you're
in java/scala.

On Thu, Nov 6, 2014 at 11:26 PM, ll duy.huynh@gmail.com wrote:

what is the best way to save an mllib model that you just trained and
reload
it in the future? specifically, i'm using the mllib word2vec model...
thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Nick Pentreath

For ALS if you want real time recs (and usually this is order 10s to a few 100s
ms response), then Spark is not the way to go - a serving layer like Oryx, or
prediction.io is what you want.

(At graphflow we've built our own).

You hold the factor matrices in memory and do the dot product in real time
(with optional caching). Again, even for huge models (10s of millions
users/items) this can be handled on a single, powerful instance. The issue at
this scale is winnowing down the search space using LSH or similar approach to
get to real time speeds.

For word2vec it's pretty much the same thing as what you have is very similar
to one of the ALS factor matrices.

One problem is you can't access the wors2vec vectors as they are private val. I
think this should be changed actually, so that just the word vectors could be
saved and used in a serving layer.

—
Sent from Mailbox

On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

There are a few examples where this is the case. Let's take ALS, where the
result is a MatrixFactorizationModel, which is assumed to be big - the
model consists of two matrices, one (users x k) and one (k x products).
These are represented as RDDs.
You can save these RDDs out to disk by doing something like
model.userFeatures.saveAsObjectFile(...) and
model.productFeatures.saveAsObjectFile(...)
to save out to HDFS or Tachyon or S3.
Then, when you want to reload you'd have to instantiate them into a class
of MatrixFactorizationModel. That class is package private to MLlib right
now, so you'd need to copy the logic over to a new class, but that's the
basic idea.
That said - using spark to serve these recommendations on a point-by-point
basis might not be optimal. There's some work going on in the AMPLab to
address this issue.
On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh duy.huynh@gmail.com wrote:
you're right, serialization works.

what is your suggestion on saving a distributed model? so part of the
model is in one cluster, and some other parts of the model are in other
clusters. during runtime, these sub-models run independently in their own
clusters (load, train, save). and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.

much appreciated.

On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:

There's some work going on to support PMML -
https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
merged into master.

What are you used to doing in other environments? In R I'm used to
running save(), same with matlab. In python either pickling things or
dumping to json seems pretty common. (even the scikit-learn docs recommend
pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
These all seem basically equivalent java serialization to me..