date:20141030

Thanks Jimmy.
I will have a try.
Thanks very much for your guys' help.

Best,
Peng

On Thu, Oct 30, 2014 at 8:19 PM, Jimmy  wrote:

> sampleRDD. cache()
>
> Sent from my iPhone
>
> On Oct 30, 2014, at 5:01 PM, peng xia  wrote:
>
> Hi Xiangrui,
>
> Can you give me some code example about caching, as I am new to Spark.
>
> Thanks,
> Best,
> Peng
>
> On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng  wrote:
>
>> Then caching should solve the problem. Otherwise, it is just loading
>> and parsing data from disk for each iteration. -Xiangrui
>>
>> On Thu, Oct 30, 2014 at 11:44 AM, peng xia  wrote:
>> > Thanks for all your help.
>> > I think I didn't cache the data. My previous cluster was expired and I
>> don't
>> > have a chance to check the load balance or app manager.
>> > Below is my code.
>> > There are 18 features for each record and I am using the Scala API.
>> >
>> > import org.apache.spark.SparkConf
>> > import org.apache.spark.SparkContext
>> > import org.apache.spark.SparkContext._
>> > import org.apache.spark.rdd._
>> > import org.apache.spark.mllib.classification.SVMWithSGD
>> > import org.apache.spark.mllib.regression.LabeledPoint
>> > import org.apache.spark.mllib.linalg.Vectors
>> > import java.util.Calendar
>> >
>> > object BenchmarkClassification {
>> > def main(args: Array[String]) {
>> > // Load and parse the data file
>> > val conf = new SparkConf()
>> >  .setAppName("SVM")
>> >  .set("spark.executor.memory", "8g")
>> >  // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
>> >val sc = new SparkContext(conf)
>> > val data = sc.textFile(args(0))
>> > val parsedData = data.map { line =>
>> >  val parts = line.split(',')
>> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
>> > x.toDouble)))
>> > }
>> > val testData = sc.textFile(args(1))
>> > val testParsedData = testData .map { line =>
>> >  val parts = line.split(',')
>> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
>> > x.toDouble)))
>> > }
>> >
>> > // Run training algorithm to build the model
>> > val numIterations = 20
>> > val model = SVMWithSGD.train(parsedData, numIterations)
>> >
>> > // Evaluate model on training examples and compute training error
>> > // val labelAndPreds = testParsedData.map { point =>
>> > //   val prediction = model.predict(point.features)
>> > //   (point.label, prediction)
>> > // }
>> > // val trainErr = labelAndPreds.filter(r => r._1 !=
>> r._2).count.toDouble /
>> > testParsedData.count
>> > // println("Training Error = " + trainErr)
>> > println(Calendar.getInstance().getTime())
>> > }
>> > }
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Best,
>> > Peng
>> >
>> > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng 
>> wrote:
>> >>
>> >> DId you cache the data and check the load balancing? How many
>> >> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>> >>
>> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy  wrote:
>> >> > Watch the app manager it should tell you what's running and taking
>> >> > awhile...
>> >> > My guess it's a "distinct" function on the data.
>> >> > J
>> >> >
>> >> > Sent from my iPhone
>> >> >
>> >> > On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> >
>> >> >
>> >> > Previous we have applied SVM algorithm in MLlib to 5 million records
>> >> > (600
>> >> > mb), it takes more than 25 minutes to finish.
>> >> > The spark version we are using is 1.0 and we were running this
>> program
>> >> > on a
>> >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
>> >> >
>> >> > The 5 million records only have two distinct records (One positive
>> and
>> >> > one
>> >> > negative), others are all duplications.
>> >> >
>> >> > Any one has any idea on why it takes so long on this small data?
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Best,
>> >> >
>> >> > Peng
>> >
>> >
>>
>
>

Confused about class paths in spark 1.1.0

2014-10-30 Thread Shay Seng

Hi,

I've been trying to move up from spark 0.9.2 to 1.1.0.
I'm getting a little confused with the setup for a few different use cases,
grateful for any pointers...

(1) spark-shell + with jars that are only required by the driver
(1a)
I added "spark.driver.extraClassPath  /mypath/to.jar" to my
spark-defaults.conf
I launched spark-shell with:  ./spark-shell

Here I see on the WebUI that spark.driver.extraClassPath has been set, but
I am NOT able to access any methods in the jar.

(1b)
I removed "spark.driver.extraClassPath" from my spark-default.conf
I launched spark-shell with  .//spark-shell --driver.class.path
/mypath/to.jar

Again I see that the WebUI spark.driver.extraClassPath has been set.
But this time I am able to access the methods in the jar.

Q: Is spark-shell not considered the driver in this case?  why does using
--driver.class.path on the command line have a different behavior to
setting it in spark-defaults.conf ?


(2) Rather than adding each jar individually, is there a way to use
wildcards? Previously with SPARK_CLASS_PATH I was able to use /*
 but with --driver.class.path it seems to require individual files.

tks
Shay

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy

sampleRDD. cache()

Sent from my iPhone

> On Oct 30, 2014, at 5:01 PM, peng xia  wrote:
> 
> Hi Xiangrui, 
> 
> Can you give me some code example about caching, as I am new to Spark.
> 
> Thanks,
> Best,
> Peng
> 
>> On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng  wrote:
>> Then caching should solve the problem. Otherwise, it is just loading
>> and parsing data from disk for each iteration. -Xiangrui
>> 
>> On Thu, Oct 30, 2014 at 11:44 AM, peng xia  wrote:
>> > Thanks for all your help.
>> > I think I didn't cache the data. My previous cluster was expired and I 
>> > don't
>> > have a chance to check the load balance or app manager.
>> > Below is my code.
>> > There are 18 features for each record and I am using the Scala API.
>> >
>> > import org.apache.spark.SparkConf
>> > import org.apache.spark.SparkContext
>> > import org.apache.spark.SparkContext._
>> > import org.apache.spark.rdd._
>> > import org.apache.spark.mllib.classification.SVMWithSGD
>> > import org.apache.spark.mllib.regression.LabeledPoint
>> > import org.apache.spark.mllib.linalg.Vectors
>> > import java.util.Calendar
>> >
>> > object BenchmarkClassification {
>> > def main(args: Array[String]) {
>> > // Load and parse the data file
>> > val conf = new SparkConf()
>> >  .setAppName("SVM")
>> >  .set("spark.executor.memory", "8g")
>> >  // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
>> >val sc = new SparkContext(conf)
>> > val data = sc.textFile(args(0))
>> > val parsedData = data.map { line =>
>> >  val parts = line.split(',')
>> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
>> > x.toDouble)))
>> > }
>> > val testData = sc.textFile(args(1))
>> > val testParsedData = testData .map { line =>
>> >  val parts = line.split(',')
>> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
>> > x.toDouble)))
>> > }
>> >
>> > // Run training algorithm to build the model
>> > val numIterations = 20
>> > val model = SVMWithSGD.train(parsedData, numIterations)
>> >
>> > // Evaluate model on training examples and compute training error
>> > // val labelAndPreds = testParsedData.map { point =>
>> > //   val prediction = model.predict(point.features)
>> > //   (point.label, prediction)
>> > // }
>> > // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
>> > testParsedData.count
>> > // println("Training Error = " + trainErr)
>> > println(Calendar.getInstance().getTime())
>> > }
>> > }
>> >
>> >
>> >
>> >
>> > Thanks,
>> > Best,
>> > Peng
>> >
>> > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng  wrote:
>> >>
>> >> DId you cache the data and check the load balancing? How many
>> >> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>> >>
>> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy  wrote:
>> >> > Watch the app manager it should tell you what's running and taking
>> >> > awhile...
>> >> > My guess it's a "distinct" function on the data.
>> >> > J
>> >> >
>> >> > Sent from my iPhone
>> >> >
>> >> > On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> >
>> >> >
>> >> > Previous we have applied SVM algorithm in MLlib to 5 million records
>> >> > (600
>> >> > mb), it takes more than 25 minutes to finish.
>> >> > The spark version we are using is 1.0 and we were running this program
>> >> > on a
>> >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
>> >> >
>> >> > The 5 million records only have two distinct records (One positive and
>> >> > one
>> >> > negative), others are all duplications.
>> >> >
>> >> > Any one has any idea on why it takes so long on this small data?
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> > Best,
>> >> >
>> >> > Peng
>> >
>> >
>

Re: SparkContext UI

Hi Stuart,

You're close!

Just add a () after the cache, like: data.cache()

...and then run the .count() action on it and you should be good to see it
in the Storage UI!


- Sameer

On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman 
wrote:

> Sorry too quick to pull the trigger on my original email.  I should have
> added that I'm tried using persist() and cache() but no joy.
>
> I'm doing this:
>
> data = sc.textFile("somedata")
>
> data.cache
>
> data.count()
>
> but I still can't see anything in the storage?
>
>
>
> On 31 October 2014 10:42, Sameer Farooqui  wrote:
>
>> Hey Stuart,
>>
>> The RDD won't show up under the Storage tab in the UI until it's been
>> cached. Basically Spark doesn't know what the RDD will look like until it's
>> cached, b/c up until then the RDD is just on disk (external to Spark). If
>> you launch some transformations + an action on an RDD that is purely on
>> disk, then Spark will read it from disk, compute against it and then write
>> the results back to disk or show you the results at the scala/python
>> shells. But when you run Spark workloads against purely on disk files, the
>> RDD won't show up in Spark's Storage UI. Hope that makes sense...
>>
>> - Sameer
>>
>> On Thu, Oct 30, 2014 at 4:30 PM, Stuart Horsman > > wrote:
>>
>>> Hi All,
>>>
>>> When I load an RDD with:
>>>
>>> data = sc.textFile("somefile")
>>>
>>> I don't see the resulting RDD in the SparkContext gui on localhost:4040
>>> in /storage.
>>>
>>> Is there something special I need to do to allow me to view this?  I
>>> tried but scala and python shells but same result.
>>>
>>> Thanks
>>>
>>> Stuart
>>>
>>
>>
>

Scaladoc

2014-10-30 Thread Alessandro Baretta

How do I build the scaladoc html files from the spark source distribution?

Alex Bareta

Re: issue on applying SVM to 5 million examples.

Hi Xiangrui,

Can you give me some code example about caching, as I am new to Spark.

Thanks,
Best,
Peng

On Thu, Oct 30, 2014 at 6:57 PM, Xiangrui Meng  wrote:

> Then caching should solve the problem. Otherwise, it is just loading
> and parsing data from disk for each iteration. -Xiangrui
>
> On Thu, Oct 30, 2014 at 11:44 AM, peng xia  wrote:
> > Thanks for all your help.
> > I think I didn't cache the data. My previous cluster was expired and I
> don't
> > have a chance to check the load balance or app manager.
> > Below is my code.
> > There are 18 features for each record and I am using the Scala API.
> >
> > import org.apache.spark.SparkConf
> > import org.apache.spark.SparkContext
> > import org.apache.spark.SparkContext._
> > import org.apache.spark.rdd._
> > import org.apache.spark.mllib.classification.SVMWithSGD
> > import org.apache.spark.mllib.regression.LabeledPoint
> > import org.apache.spark.mllib.linalg.Vectors
> > import java.util.Calendar
> >
> > object BenchmarkClassification {
> > def main(args: Array[String]) {
> > // Load and parse the data file
> > val conf = new SparkConf()
> >  .setAppName("SVM")
> >  .set("spark.executor.memory", "8g")
> >  // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
> >val sc = new SparkContext(conf)
> > val data = sc.textFile(args(0))
> > val parsedData = data.map { line =>
> >  val parts = line.split(',')
> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> > x.toDouble)))
> > }
> > val testData = sc.textFile(args(1))
> > val testParsedData = testData .map { line =>
> >  val parts = line.split(',')
> >  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> > x.toDouble)))
> > }
> >
> > // Run training algorithm to build the model
> > val numIterations = 20
> > val model = SVMWithSGD.train(parsedData, numIterations)
> >
> > // Evaluate model on training examples and compute training error
> > // val labelAndPreds = testParsedData.map { point =>
> > //   val prediction = model.predict(point.features)
> > //   (point.label, prediction)
> > // }
> > // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble
> /
> > testParsedData.count
> > // println("Training Error = " + trainErr)
> > println(Calendar.getInstance().getTime())
> > }
> > }
> >
> >
> >
> >
> > Thanks,
> > Best,
> > Peng
> >
> > On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng  wrote:
> >>
> >> DId you cache the data and check the load balancing? How many
> >> features? Which API are you using, Scala, Java, or Python? -Xiangrui
> >>
> >> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy  wrote:
> >> > Watch the app manager it should tell you what's running and taking
> >> > awhile...
> >> > My guess it's a "distinct" function on the data.
> >> > J
> >> >
> >> > Sent from my iPhone
> >> >
> >> > On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> >
> >> > Previous we have applied SVM algorithm in MLlib to 5 million records
> >> > (600
> >> > mb), it takes more than 25 minutes to finish.
> >> > The spark version we are using is 1.0 and we were running this program
> >> > on a
> >> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
> >> >
> >> > The 5 million records only have two distinct records (One positive and
> >> > one
> >> > negative), others are all duplications.
> >> >
> >> > Any one has any idea on why it takes so long on this small data?
> >> >
> >> >
> >> >
> >> > Thanks,
> >> > Best,
> >> >
> >> > Peng
> >
> >
>

Re: SparkContext UI

2014-10-30 Thread Stuart Horsman

Sorry too quick to pull the trigger on my original email.  I should have
added that I'm tried using persist() and cache() but no joy.

I'm doing this:

data = sc.textFile("somedata")

data.cache

data.count()

but I still can't see anything in the storage?



On 31 October 2014 10:42, Sameer Farooqui  wrote:

> Hey Stuart,
>
> The RDD won't show up under the Storage tab in the UI until it's been
> cached. Basically Spark doesn't know what the RDD will look like until it's
> cached, b/c up until then the RDD is just on disk (external to Spark). If
> you launch some transformations + an action on an RDD that is purely on
> disk, then Spark will read it from disk, compute against it and then write
> the results back to disk or show you the results at the scala/python
> shells. But when you run Spark workloads against purely on disk files, the
> RDD won't show up in Spark's Storage UI. Hope that makes sense...
>
> - Sameer
>
> On Thu, Oct 30, 2014 at 4:30 PM, Stuart Horsman 
> wrote:
>
>> Hi All,
>>
>> When I load an RDD with:
>>
>> data = sc.textFile("somefile")
>>
>> I don't see the resulting RDD in the SparkContext gui on localhost:4040
>> in /storage.
>>
>> Is there something special I need to do to allow me to view this?  I
>> tried but scala and python shells but same result.
>>
>> Thanks
>>
>> Stuart
>>
>
>

Re: SparkContext UI

Hey Stuart,

The RDD won't show up under the Storage tab in the UI until it's been
cached. Basically Spark doesn't know what the RDD will look like until it's
cached, b/c up until then the RDD is just on disk (external to Spark). If
you launch some transformations + an action on an RDD that is purely on
disk, then Spark will read it from disk, compute against it and then write
the results back to disk or show you the results at the scala/python
shells. But when you run Spark workloads against purely on disk files, the
RDD won't show up in Spark's Storage UI. Hope that makes sense...

- Sameer

On Thu, Oct 30, 2014 at 4:30 PM, Stuart Horsman 
wrote:

> Hi All,
>
> When I load an RDD with:
>
> data = sc.textFile("somefile")
>
> I don't see the resulting RDD in the SparkContext gui on localhost:4040 in
> /storage.
>
> Is there something special I need to do to allow me to view this?  I tried
> but scala and python shells but same result.
>
> Thanks
>
> Stuart
>

SparkContext UI

2014-10-30 Thread Stuart Horsman

Hi All,

When I load an RDD with:

data = sc.textFile("somefile")

I don't see the resulting RDD in the SparkContext gui on localhost:4040 in
/storage.

Is there something special I need to do to allow me to view this?  I tried
but scala and python shells but same result.

Thanks

Stuart

Re: akka connection refused bug, fix?

followed this

http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Akka-Error-while-running-Spark-Jobs/td-p/18602

but the problem was not fixed..



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764p17774.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkSQL + Hive Cached Table Exception

2014-10-30 Thread Jean-Pascal Billaud

Hi,

While testing SparkSQL on top of our Hive metastore, I am getting
some java.lang.ArrayIndexOutOfBoundsException while reusing a cached RDD
table.

Basically, I have a table "mtable" partitioned by some "date" field in hive
and below is the scala code I am running in spark-shell:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
val rdd_mtable = sqlContext.sql("select * from mtable where date=20141028");
rdd_mtable.registerTempTable("rdd_mtable");
sqlContext.cacheTable("rdd_mtable");
sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- OK
sqlContext.sql("select count(*) from rdd_mtable").collect(); <-- Exception

So the first collect() is working just fine, however running the second
collect() which I expect use the cached RDD throws some
java.lang.ArrayIndexOutOfBoundsException, see the backtrace at the end of
this email. It seems the columnar traversal is crashing for some reasons.
FYI, I am using spark ToT (234de9232bcfa212317a8073c4a82c3863b36b14).

java.lang.ArrayIndexOutOfBoundsException: 14
at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(Row.scala:142)
at
org.apache.spark.sql.catalyst.expressions.BoundReference.eval(BoundAttribute.scala:37)
at
org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:89)
at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
at
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$computeSizeInBytes$1.apply(InMemoryColumnarTableScan.scala:66)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at
org.apache.spark.sql.columnar.InMemoryRelation.computeSizeInBytes(InMemoryColumnarTableScan.scala:66)
at
org.apache.spark.sql.columnar.InMemoryRelation.statistics(InMemoryColumnarTableScan.scala:87)
at
org.apache.spark.sql.columnar.InMemoryRelation.statisticsToBePropagated(InMemoryColumnarTableScan.scala:73)
at
org.apache.spark.sql.columnar.InMemoryRelation.withOutput(InMemoryColumnarTableScan.scala:147)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1$$anonfun$applyOrElse$1.apply(CacheManager.scala:122)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:122)
at
org.apache.spark.sql.CacheManager$$anonfun$useCachedData$1.applyOrElse(CacheManager.scala:119)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:191)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:147)
at
org.apache.spark.sql.CacheManager$class.useCachedData(CacheManager.scala:119)
at org.apache.spark.sql.SQLContext.useCachedData(SQLContext.scala:49)
at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:376)
at
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:376)
at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:377)
at
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:377)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:382)
at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:380)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:386)
at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:38

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng

Then caching should solve the problem. Otherwise, it is just loading
and parsing data from disk for each iteration. -Xiangrui

On Thu, Oct 30, 2014 at 11:44 AM, peng xia  wrote:
> Thanks for all your help.
> I think I didn't cache the data. My previous cluster was expired and I don't
> have a chance to check the load balance or app manager.
> Below is my code.
> There are 18 features for each record and I am using the Scala API.
>
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.rdd._
> import org.apache.spark.mllib.classification.SVMWithSGD
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.mllib.linalg.Vectors
> import java.util.Calendar
>
> object BenchmarkClassification {
> def main(args: Array[String]) {
> // Load and parse the data file
> val conf = new SparkConf()
>  .setAppName("SVM")
>  .set("spark.executor.memory", "8g")
>  // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
>val sc = new SparkContext(conf)
> val data = sc.textFile(args(0))
> val parsedData = data.map { line =>
>  val parts = line.split(',')
>  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> x.toDouble)))
> }
> val testData = sc.textFile(args(1))
> val testParsedData = testData .map { line =>
>  val parts = line.split(',')
>  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
> x.toDouble)))
> }
>
> // Run training algorithm to build the model
> val numIterations = 20
> val model = SVMWithSGD.train(parsedData, numIterations)
>
> // Evaluate model on training examples and compute training error
> // val labelAndPreds = testParsedData.map { point =>
> //   val prediction = model.predict(point.features)
> //   (point.label, prediction)
> // }
> // val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
> testParsedData.count
> // println("Training Error = " + trainErr)
> println(Calendar.getInstance().getTime())
> }
> }
>
>
>
>
> Thanks,
> Best,
> Peng
>
> On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng  wrote:
>>
>> DId you cache the data and check the load balancing? How many
>> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>>
>> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy  wrote:
>> > Watch the app manager it should tell you what's running and taking
>> > awhile...
>> > My guess it's a "distinct" function on the data.
>> > J
>> >
>> > Sent from my iPhone
>> >
>> > On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
>> >
>> > Hi,
>> >
>> >
>> >
>> > Previous we have applied SVM algorithm in MLlib to 5 million records
>> > (600
>> > mb), it takes more than 25 minutes to finish.
>> > The spark version we are using is 1.0 and we were running this program
>> > on a
>> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
>> >
>> > The 5 million records only have two distinct records (One positive and
>> > one
>> > negative), others are all duplications.
>> >
>> > Any one has any idea on why it takes so long on this small data?
>> >
>> >
>> >
>> > Thanks,
>> > Best,
>> >
>> > Peng
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: how idf is calculated

2014-10-30 Thread Ashic Mahtab

Hi Andrejs,The calculations are a bit different to what I've come across in 
Mining Massive Datasets (2nd Ed. Ullman et. al.,  Cambridge Press) available 
here:http://www.mmds.org/ 
Their calculation of IDF is as follows:
IDFi = log2(N / ni)
where N is the number of documents and ni is the number of documents in which 
the word appears. This looks different to your IDF function.
For TF, they use
TFij = fij / maxk fkj
That is:
For document j, the term frequency of the term i in j is the number of 
times i appears in j divided by the maximum number of times any term appears in 
j. Stop words are usually excluded when considering the maximum).
So, in your case, the 
TFa1 = 2 / 2 = 1
TFb1 = 1 / 2 = 0.5TFc1 = 1/2 = 0.5TFm1 = 2/2 = 1...
IDFa = log2(3 / 2) = 0.585
So, TFa1 * IDFa = 0.585
Wikipedia mentions an adjustment to overcome biases for long documents, by 
calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change anything 
for TFa1, as the value remains 1.
In other words, my calculations don't agree with yours, and neither seem to 
agree with Spark :)
Regards,Ashic.
Date: Thu, 30 Oct 2014 22:13:49 +
Subject: how idf is calculated
From: andr...@sindicetech.com
To: u...@spark.incubator.apache.org

Hi,I'm writing a paper and I need to calculate tf-idf. Whit your help I managed 
to get results, I needed, but the problem is that I need to be able to explain 
how each number was gotten. So I tried to understand how idf was calculated and 
the numbers i get don't correspond to those I should get .  
I have 3 documents (each line a document)a a b c m me a c d e ed j k l m m c
When I calculate tf, I get this 
(1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])(1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])(1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]
idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))m -number of 
documents (3 in my case).d(t) - in how many documents is term presenta: 
log(4/3) =0.1249387366b: log(4/2) =0.3010299957c: log(4/4) =0d: log(4/3) 
=0.1249387366e: log(4/2) =0.3010299957l: log(4/2) =0.3010299957m: log(4/3) 
=0.1249387366
When I output  idf vector ` 
idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) `I get 
:1.38629436111989060.287682072451780850.6931471805599453
I understand why there are only 3 numbers, because only 3 are unique : 
log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf where 
calculated 
Best regards,Andrejs

how idf is calculated

2014-10-30 Thread Andrejs Abele

Hi,
I'm writing a paper and I need to calculate tf-idf. Whit your help I
managed to get results, I needed, but the problem is that I need to be able
to explain how each number was gotten. So I tried to understand how idf was
calculated and the numbers i get don't correspond to those I should get .

I have 3 documents (each line a document)
a a b c m m
e a c d e e
d j k l m m c

When I calculate tf, I get this
(1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])
(1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])
(1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]

idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
m -number of documents (3 in my case).
d(t) - in how many documents is term present
a: log(4/3) =0.1249387366
b: log(4/2) =0.3010299957
c: log(4/4) =0
d: log(4/3) =0.1249387366
e: log(4/2) =0.3010299957
l: log(4/2) =0.3010299957
m: log(4/3) =0.1249387366

When I output  idf vector `
idf.idf.toArray.filter(_.>(0)).distinct.foreach(println(_)) `
I get :
1.3862943611198906
0.28768207245178085
0.6931471805599453

I understand why there are only 3 numbers, because only 3 are unique :
log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf
where calculated

Best regards,
Andrejs

Re: does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr

I think I understand how to deal with this, though I don't have all the code
working yet.  The point is that the V of (K, V) can itself be a tuple.  So
the updateFunc prototype looks something like

val updateDhcpState = (newValues: Seq[Tuple1[(Int, Time, Time)]], state:
Option[Tuple1[(Int, Time, Time)]]) =>
Option[Tuple1[(Int, Time, Time)]]
{  ...   }

And I'm wondering whether those Tuple1()s are superfluous.  Film at 11.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/does-updateStateByKey-accept-a-state-that-is-a-tuple-tp17756p17769.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best way to partition RDD

Thanks Helena, very useful comment,
But is "‘spark.cassandra.input.split.size" only effective in Cluster not in
Single node?

best,
/Shahab

On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson  wrote:

> Shahab,
>
> Regardless, WRT cassandra and spark when using the spark cassandra
> connector,  ‘spark.cassandra.input.split.size’ passed into the SparkConf
> configures the approx number of Cassandra partitions in a Spark partition
> (default 10).
> No repartitioning should be necessary with what you have below, but I
> don’t know if you are running on one node or a cluster.
>
> This is a good initial guide:
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#configuration-options-for-adjusting-reads
>
> https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L26-L37
>
> Cheers,
> Helena
> @helenaedelson
>
> On Oct 30, 2014, at 1:12 PM, Helena Edelson 
> wrote:
>
> Hi Shahab,
> -How many spark/cassandra nodes are in your cluster?
> -What is your deploy topology for spark and cassandra clusters? Are they
> co-located?
>
> - Helena
> @helenaedelson
>
> On Oct 30, 2014, at 12:16 PM, shahab  wrote:
>
> Hi.
>
> I am running an application in the Spark which first loads data from
> Cassandra and then performs some map/reduce jobs.
>
> val srdd = sqlContext.sql("select * from mydb.mytable "  )
> I noticed that the "srdd" only has one partition . no matter how big is
> the data loaded form Cassandra.
>
> So I perform "repartition" on the RDD , and then I did the map/reduce
> functions.
>
> But the main problem is that "repartition" takes so much time (almost 2
> min), which is not acceptable in my use-case. Is there any better way to do
> repartitioning?
>
> best,
> /Shahab
>
>
>
>

Re: Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread Michael Armbrust

That should be possible, although I'm not super familiar with thrift.
You'll probably need access to the generated metadata

.

 If you find yourself reading a lot of thrift data you
might consider writing a library that goes against the new SQL Data Source
API , which is about to be
merged in.  Its essentially applySchema on steroids.

This code for avro is possibly a useful reference:
https://github.com/marmbrus/sql-avro

On Thu, Oct 30, 2014 at 2:13 PM, ankits  wrote:

> I have one job with spark that creates some RDDs of type X and persists
> them
> in memory. The type X is an auto generated Thrift java class (not a case
> class though). Now in another job, I want to convert the RDD to a SchemaRDD
> using sqlContext.applySchema(). Can I derive a schema from the thrift
> definitions to convert RDD[X] to SchemaRDD[X]?
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-SchemaRDD-from-RDD-of-thrift-classes-tp17766.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Creating a SchemaRDD from RDD of thrift classes

2014-10-30 Thread ankits

I have one job with spark that creates some RDDs of type X and persists them
in memory. The type X is an auto generated Thrift java class (not a case
class though). Now in another job, I want to convert the RDD to a SchemaRDD
using sqlContext.applySchema(). Can I derive a schema from the thrift
definitions to convert RDD[X] to SchemaRDD[X]?









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-a-SchemaRDD-from-RDD-of-thrift-classes-tp17766.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Registering custom metrics

2014-10-30 Thread Gerard Maas

vHi,

I've been exploring the metrics exposed by Spark and I'm wondering whether
there's a way to register job-specific metrics that could be exposed
through the existing metrics system.

Would there be an  example somewhere?

BTW, documentation about how the metrics work could be improved. I found
out about the default servlet and the metrics/json/ endpoint on the code. I
could not find any reference to that on the dedicated doc page [1].
Probably something I could contribute if there's nobody on that at the
moment.

-kr, Gerard.

[1]   http://spark.apache.org/docs/1.1.0/monitoring.html#Metrics

akka connection refused bug, fix?

Hi, I saw the same issue as this thread,

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-td9864.html

Anyone has a fix for this bug? Please?!

The log info in my worker node is like,

14/10/30 20:15:18 INFO Worker: Asked to kill executor
app-20141030201514-/0
14/10/30 20:15:18 INFO ExecutorRunner: Runner thread for executor
app-20141030201514-/0 interrupted
14/10/30 20:15:18 INFO ExecutorRunner: Killing process!
14/10/30 20:15:18 INFO Worker: Executor app-20141030201514-/0 finished
with state KILLED exitStatus 1
14/10/30 20:15:18 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.180.49.228%3A47087-2#-814958390]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/30 20:15:18 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:42816] ->
[akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:35811]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:35811]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:35811
]
14/10/30 20:15:18 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:42816] ->
[akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:35811]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:35811]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:35811
]
14/10/30 20:15:18 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:42816] ->
[akka.tcp://sparkExecutor@node001:35811]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:35811]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:35811]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:35811




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/akka-connection-refused-bug-fix-tp17764.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: k-mean - result interpretation

Yes that is exactly it. The values are not comparable since normalization
is also shrinking all distances. Squared error is not an absolute metric.

I haven't thought about this much but maybe you are looking for something
like the silhouette coefficient?
On Oct 30, 2014 5:35 PM, "mgCl2"  wrote:

> Hello everyone,
>
> I'm trying to use MLlib's K-mean algorithm.
>
> I tried it on raw data, Here is a example of a line contained in my input
> data set:
> 82.9817 3281.4495
>
> with those parameters:
> *numClusters*=4
> *numIterations*=20
>
> results:
> *WSSSE = 6.375371241589461E9*
>
> Then I normalized my data:
> 0.02219046937793337492 0.97780953062206662508
> With the same parameters, result is now:
>  *WSSSE= 0.04229916511906393*
>
> Is it normal that normalization improve my results?
> Why isn't the WSSSE normalized? Because it seems that having smaller values
> end to a smaller WSSSE
> I'm sure I missed something here!
>
> Florent
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/k-mean-result-interpretation-tp17748.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Do Spark executors restrict native heap vs JVM heap?

No, but, the JVM also does not allocate memory for native code on the heap.
I dont think heap has any bearing on whether your native code can't
allocate more memory except that of course the heap is also taking memory.
On Oct 30, 2014 6:43 PM, "Paul Wais"  wrote:

> Dear Spark List,
>
> I have a Spark app that runs native code inside map functions.  I've
> noticed that the native code sometimes sets errno to ENOMEM indicating
> a lack of available memory.  However, I've verified that the /JVM/ has
> plenty of heap space available-- Runtime.getRuntime().freeMemory()
> shows gigabytes free and the native code needs only megabytes.  Does
> spark limit the /native/ heap size somehow?  Am poking through the
> executor code now but don't see anything obvious.
>
> Best Regards,
> -Paul Wais
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Out of memory with Spark Streaming

2014-10-30 Thread Chris Fregly

curious about why you're only seeing 50 records max per batch.

how many receivers are you running?  what is the rate that you're putting
data onto the stream?

per the default AWS kinesis configuration, the producer can do 1000 PUTs
per second with max 50k bytes per PUT and max 1mb per second per shard.

on the consumer side, you can only do 5 GETs per second and 2mb per second
per shard.

my hunch is that the 5 GETs per second is what's limiting your consumption
rate.

can you verify that these numbers match what you're seeing?  if so, you may
want to increase your shards and therefore the number of kinesis receivers.

otherwise, this may require some further investigation on my part.  i wanna
stay on top of this if it's an issue.

thanks for posting this, aniket!

-chris

On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Hi all
>
> Sorry but this was totally my mistake. In my persistence logic, I was
> creating async http client instance in RDD foreach but was never closing it
> leading to memory leaks.
>
> Apologies for wasting everyone's time.
>
> Thanks,
> Aniket
>
> On 12 September 2014 02:20, Tathagata Das 
> wrote:
>
>> Which version of spark are you running?
>>
>> If you are running the latest one, then could try running not a window
>> but a simple event count on every 2 second batch, and see if you are still
>> running out of memory?
>>
>> TD
>>
>>
>> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar <
>> aniket.bhatna...@gmail.com> wrote:
>>
>>> I did change it to be 1 gb. It still ran out of memory but a little
>>> later.
>>>
>>> The streaming job isnt handling a lot of data. In every 2 seconds, it
>>> doesn't get more than 50 records. Each record size is not more than 500
>>> bytes.
>>>  On Sep 11, 2014 10:54 PM, "Bharat Venkat" 
>>> wrote:
>>>
 You could set "spark.executor.memory" to something bigger than the
 default (512mb)

 On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
 aniket.bhatna...@gmail.com> wrote:

> I am running a simple Spark Streaming program that pulls in data from
> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
> data and persists to a store.
>
> The program is running in local mode right now and runs out of memory
> after a while. I am yet to investigate heap dumps but I think Spark isn't
> releasing memory after processing is complete. I have even tried changing
> storage level to disk only.
>
> Help!
>
> Thanks,
> Aniket
>

>>
>

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell

Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env.

On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev  wrote:

> Thanks.. I was using Scala 2.11.1 and was able to
> use algebird-core_2.10-0.1.11.jar with spark-shell.
>
> On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell 
> wrote:
>
>> Whats the error with the 2.10 version of algebird?
>>
>> On Thu, Oct 30, 2014 at 12:49 AM, thadude  wrote:
>>
>>> I've tried:
>>>
>>> . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar
>>>
>>> scala> import com.twitter.algebird._
>>> import com.twitter.algebird._
>>>
>>> scala>  import HyperLogLog._
>>> import HyperLogLog._
>>>
>>> scala> import com.twitter.algebird.HyperLogLogMonoid
>>> import com.twitter.algebird.HyperLogLogMonoid
>>>
>>>
>>> scala> val hll = new HyperLogLogMonoid(12)
>>> hll: com.twitter.algebird.HyperLogLogMonoid =
>>> com.twitter.algebird.HyperLogLogMonoid@7bde289a
>>>
>>>
>>> https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Algebird-using-spark-shell-tp17701p17714.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Re: issue on applying SVM to 5 million examples.

Thanks for all your help.
I think I didn't cache the data. My previous cluster was expired and I
don't have a chance to check the load balance or app manager.
Below is my code.
There are 18 features for each record and I am using the Scala API.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import java.util.Calendar

object BenchmarkClassification {
def main(args: Array[String]) {
// Load and parse the data file
val conf = new SparkConf()
  .setAppName("SVM")
  .set("spark.executor.memory", "8g")
  // .set("spark.executor.extraJavaOptions", "-Xms8g -Xmx8g")
val sc = new SparkContext(conf)
val data = sc.textFile(args(0))
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
x.toDouble)))
}
val testData = sc.textFile(args(1))
val testParsedData = testData .map { line =>
  val parts = line.split(',')
  LabeledPoint(parts(0).toDouble, Vectors.dense(parts.tail.map(x =>
x.toDouble)))
}

// Run training algorithm to build the model
val numIterations = 20
val model = SVMWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
// val labelAndPreds = testParsedData.map { point =>
//   val prediction = model.predict(point.features)
//   (point.label, prediction)
// }
// val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble /
testParsedData.count
// println("Training Error = " + trainErr)
println(Calendar.getInstance().getTime())
}
}




Thanks,
Best,
Peng

On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui Meng  wrote:

> DId you cache the data and check the load balancing? How many
> features? Which API are you using, Scala, Java, or Python? -Xiangrui
>
> On Thu, Oct 30, 2014 at 9:13 AM, Jimmy  wrote:
> > Watch the app manager it should tell you what's running and taking
> awhile...
> > My guess it's a "distinct" function on the data.
> > J
> >
> > Sent from my iPhone
> >
> > On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
> >
> > Hi,
> >
> >
> >
> > Previous we have applied SVM algorithm in MLlib to 5 million records (600
> > mb), it takes more than 25 minutes to finish.
> > The spark version we are using is 1.0 and we were running this program
> on a
> > 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
> >
> > The 5 million records only have two distinct records (One positive and
> one
> > negative), others are all duplications.
> >
> > Any one has any idea on why it takes so long on this small data?
> >
> >
> >
> > Thanks,
> > Best,
> >
> > Peng
>

Do Spark executors restrict native heap vs JVM heap?

2014-10-30 Thread Paul Wais

Dear Spark List,

I have a Spark app that runs native code inside map functions.  I've
noticed that the native code sometimes sets errno to ENOMEM indicating
a lack of available memory.  However, I've verified that the /JVM/ has
plenty of heap space available-- Runtime.getRuntime().freeMemory()
shows gigabytes free and the native code needs only megabytes.  Does
spark limit the /native/ heap size somehow?  Am poking through the
executor code now but don't see anything obvious.

Best Regards,
-Paul Wais

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

does updateStateByKey accept a state that is a tuple?

2014-10-30 Thread spr

I'm trying to implement a Spark Streaming program to calculate the number of
instances of a given key encountered and the minimum and maximum times at
which it was encountered.  updateStateByKey seems to be just the thing, but
when I define the "state" to be a tuple, I get compile errors I'm not
finding a way around.  Perhaps it's something simple, but I'm stumped.

var lines =  ssc.textFileStream(dirArg)
var linesArray = lines.map( line => (line.split("\t")))
var newState = linesArray.map( lineArray => (lineArray(4), 1,
   Time((lineArray(0).toDouble*1000).toInt),
   Time((lineArray(0).toDouble*1000).toInt)))

val updateDhcpState = (newValues: Seq[(Int, Time, Time)], state:
Option[(Int, Time, Time)]) => 
{
  val newCount = newValues.map( x => x._1).sum
  val newMinTime = newValues.map( x => x._2).min
  val newMaxTime = newValues.map( x => x._3).max
  val (count, minTime, maxTime) = state.getOrElse((0,
Time(Int.MaxValue), Time(Int.MinValue)))

  Some((count+newCount, Seq(minTime, newMinTime).min, Seq(maxTime,
newMaxTime).max))
  //(count+newCount, Seq(minTime, newMinTime).min, Seq(maxTime,
newMaxTime).max)
}

var DhcpSvrCum = newState.updateStateByKey[(Int, Time,
Time)](updateDhcpState) 

The error I get is

[info] Compiling 3 Scala sources to
/Users/spr/Documents/.../target/scala-2.10/classes...
[error] /Users/spr/Documents/...StatefulDhcpServersHisto.scala:95: value
updateStateByKey is not a member of
org.apache.spark.streaming.dstream.DStream[(String, Int,
org.apache.spark.streaming.Time, org.apache.spark.streaming.Time)]
[error] var DhcpSvrCum = newState.updateStateByKey[(Int, Time,
Time)](updateDhcpState) 

I don't understand why the String is being prepended to the tuple I expect
(Int, Time, Time).  In the main example (StatefulNetworkWordCount,  here

 
), the data is a stream of (String, Int) tuples created by

val wordDstream = words.map(x => (x, 1))

and the updateFunc ignores the String key in its definition

val updateFunc = (values: Seq[Int], state: Option[Int]) => {
  val currentCount = values.sum
  val previousCount = state.getOrElse(0)
  Some(currentCount + previousCount)
}

Is there some special-casing of a key with a simple (non-tuple) value?  How
could this work with a tuple value?  

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/does-updateStateByKey-accept-a-state-that-is-a-tuple-tp17756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: stage failure: java.lang.IllegalStateException: unread block data

The worker side has error message as this,

14/10/30 18:29:00 INFO Worker: Asked to launch executor
app-20141030182900-0006/0 for testspark_v1
14/10/30 18:29:01 INFO ExecutorRunner: Launch command: "java" "-cp"
"::/root/spark-1.1.0/conf:/root/spark-1.1.0/assembly/target/scala-2.10/spark-assembly-1.1.0-hadoop2.3.0.jar"
"-XX:MaxPermSize=128m" "-Dspark.driver.port=52552" "-Xms512M" "-Xmx512M"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"akka.tcp://sparkDriver@master:52552/user/CoarseGrainedScheduler" "0"
"node001" "4" "akka.tcp://sparkWorker@node001:60184/user/Worker"
"app-20141030182900-0006"
14/10/30 18:29:03 INFO Worker: Asked to kill executor
app-20141030182900-0006/0
14/10/30 18:29:03 INFO ExecutorRunner: Runner thread for executor
app-20141030182900-0006/0 interrupted
14/10/30 18:29:03 INFO ExecutorRunner: Killing process!
14/10/30 18:29:03 ERROR FileAppender: Error writing stream to file
/root/spark-1.1.0/work/app-20141030182900-0006/0/stderr
java.io.IOException: Stream Closed
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:214)
at
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
14/10/30 18:29:04 INFO Worker: Executor app-20141030182900-0006/0 finished
with state KILLED exitStatus 143
14/10/30 18:29:04 INFO LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%4010.180.49.228%3A52120-22#1336571562]
was not delivered. [6] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/10/30 18:29:04 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:60184] ->
[akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:37697]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:37697]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:37697
]
14/10/30 18:29:04 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:60184] ->
[akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:37697]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:37697]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:37697
]
14/10/30 18:29:04 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkWorker@node001:60184] ->
[akka.tcp://sparkExecutor@node001:37697]: Error [Association failed with
[akka.tcp://sparkExecutor@node001:37697]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkExecutor@node001:37697]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: node001/10.180.49.228:37697
]

Thanks!





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/stage-failure-java-lang-IllegalStateException-unread-block-data-tp17751p17755.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Ambiguous references to id : what does it mean ?

2014-10-30 Thread Terry Siu

Found this as I am having the same issue. I have exactly the same usage as
shown in Michael's join example. I tried executing a SQL statement against
the join data set with two columns that have the same name and tried to
"unambiguate" the column name with the table alias, but I would still get an
"Unresolved attributes" error back. Is there any way around this short of
renaming the columns in the join sources?

Thanks
-Terry

Michael Armbrust wrote
Yes, but if both tagCollection and selectedVideos have a column named "id"
then Spark SQL does not know which one you are referring to in the where
clause.  Here's an example with aliases:
 val x = testData2.as('x)
 val y = testData2.as('y)
 val join = x.join(y, Inner, Some("x.a".attr === "y.a".attr))
On Wed, Jul 16, 2014 at 2:47 AM, Jaonary Rabarisoa <

jaonary@

>
wrote:
My query is just a simple query that use the spark sql dsl :

tagCollection.join(selectedVideos).where('videoId === 'id)

On Tue, Jul 15, 2014 at 6:03 PM, Yin Huai <

huaiyin.thu@

> wrote:

Hi Jao,

Seems the SQL analyzer cannot resolve the references in the Join
condition. What is your query? Did you use the Hive Parser (your query
was
submitted through hql(...)) or the basic SQL Parser (your query was
submitted through sql(...)).

Thanks,

Yin

On Tue, Jul 15, 2014 at 8:52 AM, Jaonary Rabarisoa <

jaonary@

>
wrote:

Hi all,

When running a join operation with Spark SQL I got the following error
:

Exception in thread "main"
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Ambiguous
references to id: (id#303,List()),(id#0,List()), tree:
Filter ('videoId = 'id)
  Join Inner, None
   ParquetRelation data/tags.parquet
   Filter (name#1 = P1/cam1)
ParquetRelation data/videos.parquet

What does it mean ?

Cheers,

jao

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

By the way, in case you haven't done so, do try to .cache() the RDD before
running a .count() on it as that could make a big speed improvement.



On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui 
wrote:

> Hi Shahab,
>
> Are you running Spark in Local, Standalone, YARN or Mesos mode?
>
> If you're running in Standalone/YARN/Mesos, then the .count() action is
> indeed automatically parallelized across multiple Executors.
>
> When you run a .count() on an RDD, it is actually distributing tasks to
> different executors to each do a local count on a local partition and then
> all the tasks send their sub-counts back to the driver for final
> aggregation. This sounds like the kind of behavior you're looking for.
>
> However, in Local mode, everything runs in a single JVM (the driver +
> executor), so there's no parallelization across Executors.
>
>
>
> On Thu, Oct 30, 2014 at 10:25 AM, shahab  wrote:
>
>> Hi,
>>
>> I noticed that the "count" (of RDD)  in many of my queries is the most
>> time consuming one as it runs in the "driver" process rather then done by
>> parallel worker nodes,
>>
>> Is there any way to perform "count" in parallel , at at least parallelize
>>  it as much as possible?
>>
>> best,
>> /Shahab
>>
>
>

Re: Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Hi Shahab,

Are you running Spark in Local, Standalone, YARN or Mesos mode?

If you're running in Standalone/YARN/Mesos, then the .count() action is
indeed automatically parallelized across multiple Executors.

When you run a .count() on an RDD, it is actually distributing tasks to
different executors to each do a local count on a local partition and then
all the tasks send their sub-counts back to the driver for final
aggregation. This sounds like the kind of behavior you're looking for.

However, in Local mode, everything runs in a single JVM (the driver +
executor), so there's no parallelization across Executors.

On Thu, Oct 30, 2014 at 10:25 AM, shahab  wrote:

> Hi,
>
> I noticed that the "count" (of RDD)  in many of my queries is the most
> time consuming one as it runs in the "driver" process rather then done by
> parallel worker nodes,
>
> Is there any way to perform "count" in parallel , at at least parallelize
>  it as much as possible?
>
> best,
> /Shahab
>

stage failure: java.lang.IllegalStateException: unread block data

Hi, 

Got this error when running spark 1.1.0 to read Hbase 0.98.1 through simple
python code in a ec2 cluster. The same program runs correctly in local mode.
So this error only happens when running in a real cluster.

Here's what I got,

14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1, node001, ANY, 1265 bytes)
14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on
executor node001: java.lang.IllegalStateException (unread block data)
[duplicate 1]
14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID
2, node001, ANY, 1265 bytes)
14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on
executor node001: java.lang.IllegalStateException (unread block data)
[duplicate 2]
14/10/30 17:51:53 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID
3, node001, ANY, 1265 bytes)
14/10/30 17:51:53 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on
executor node001: java.lang.IllegalStateException (unread block data)
[duplicate 3]
14/10/30 17:51:53 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
14/10/30 17:51:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool 
14/10/30 17:51:53 INFO TaskSchedulerImpl: Cancelling stage 0
14/10/30 17:51:53 INFO DAGScheduler: Failed to run first at
SerDeUtil.scala:70
Traceback (most recent call last):
  File "/root/workspace/test/sparkhbase.py", line 22, in 
conf=conf2)
  File "/root/spark-1.1.0/python/pyspark/context.py", line 471, in
newAPIHadoopRDD
jconf, batchSize)
  File
"/root/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 3, node001): java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2399)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1969)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1776)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:679)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWor

Re: Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch

The byte array turns out to be a serialized ObjectOutputStream that
contains  a Tuple2[ParallelCollectionRDD,Function2].

What then should be done differently in the broadcast code (which follows
the structure of an example taken from mllib)?

assert(crows.isInstanceOf[Array[MVector]])
val bcRows = sc.broadcast(crows)
..

  val arrayVect = bcRows.value



2014-10-30 7:42 GMT-07:00 Stephen Boesch :

>
> As a template for creating a broadcast variable, the following code
> snippet within mllib was used:
>
> val bcIdf = dataset.context.broadcast(idf)
> dataset.mapPartitions { iter =>
>   val thisIdf = bcIdf.value
>
>
> The new code follows that model:
>
> import org.apache.spark.mllib.linalg.{Vector => MVector}
>   ..
> assert(crows.isInstanceOf[Array[MVector]])
> val bcRows = sc.broadcast(crows)
> val GU = mat.rows.zipWithIndex.mapPartitions { case dataIter =>
> val arrayVect = bcRows.value // bcRows.value is seen in
> debugger to be of type Array[Byte] .. ??
>
> That last line is unhappy:
>
>java.lang.ClassCastException: [B cannot be cast to
> [Lorg.apache.spark.mllib.linalg.Vector;
>
> So the compiler is aware that the return type of the broadcast "value"
> method should be an array of vector (which it should). However the actual
> type is Array[Byte].   Any insights on this?
>
>

k-mean - result interpretation

2014-10-30 Thread mgCl2

Hello everyone,

I'm trying to use MLlib's K-mean algorithm.

I tried it on raw data, Here is a example of a line contained in my input
data set:
82.9817 3281.4495

with those parameters:
*numClusters*=4
*numIterations*=20

results:
*WSSSE = 6.375371241589461E9*

Then I normalized my data:
0.02219046937793337492 0.97780953062206662508
With the same parameters, result is now:
 *WSSSE= 0.04229916511906393*

Is it normal that normalization improve my results?
Why isn't the WSSSE normalized? Because it seems that having smaller values
end to a smaller WSSSE
I'm sure I missed something here!

Florent







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/k-mean-result-interpretation-tp17748.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib: libsvm - default value initialization

2014-10-30 Thread Xiangrui Meng

You can remove 0.5 from all non-zeros. -Xiangrui

On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak  wrote:
> Hi All,
> I have my sparse data in libsvm format.
>
> val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
> "mllib/data/sample_libsvm_data.txt")
>
> I am running Linear regression. Let us say that my data has following entry:
> 1 1:0  4:1
>
> I think it will assume 0 for indices 2 and 3, right? I would like to make
> default values to be 0.5  instead of 0. Is it possible? If not, I will have
> to switch to dense data and it will significantly increase the data size for
> me.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best way to partition RDD

2014-10-30 Thread Helena Edelson

Shahab,

Regardless, WRT cassandra and spark when using the spark cassandra connector,  
‘spark.cassandra.input.split.size’ passed into the SparkConf configures the 
approx number of Cassandra partitions in a Spark partition (default 10).
No repartitioning should be necessary with what you have below, but I don’t 
know if you are running on one node or a cluster.

This is a good initial guide: 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#configuration-options-for-adjusting-reads
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L26-L37

Cheers,
Helena
@helenaedelson

On Oct 30, 2014, at 1:12 PM, Helena Edelson  wrote:

> Hi Shahab, 
> -How many spark/cassandra nodes are in your cluster?
> -What is your deploy topology for spark and cassandra clusters? Are they 
> co-located?
> 
> - Helena
> @helenaedelson
> 
> On Oct 30, 2014, at 12:16 PM, shahab  wrote:
> 
>> Hi.
>> 
>> I am running an application in the Spark which first loads data from 
>> Cassandra and then performs some map/reduce jobs.
>> val srdd = sqlContext.sql("select * from mydb.mytable "  )
>> 
>> I noticed that the "srdd" only has one partition . no matter how big is the 
>> data loaded form Cassandra.
>> 
>> So I perform "repartition" on the RDD , and then I did the map/reduce 
>> functions.
>> 
>> But the main problem is that "repartition" takes so much time (almost 2 
>> min), which is not acceptable in my use-case. Is there any better way to do 
>> repartitioning?
>> 
>> best,
>> /Shahab
>

Doing RDD."count" in parallel , at at least parallelize it as much as possible?

Hi,

I noticed that the "count" (of RDD)  in many of my queries is the most time
consuming one as it runs in the "driver" process rather then done by
parallel worker nodes,

Is there any way to perform "count" in parallel , at at least parallelize
 it as much as possible?

best,
/Shahab

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Xiangrui Meng

DId you cache the data and check the load balancing? How many
features? Which API are you using, Scala, Java, or Python? -Xiangrui

On Thu, Oct 30, 2014 at 9:13 AM, Jimmy  wrote:
> Watch the app manager it should tell you what's running and taking awhile...
> My guess it's a "distinct" function on the data.
> J
>
> Sent from my iPhone
>
> On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
>
> Hi,
>
>
>
> Previous we have applied SVM algorithm in MLlib to 5 million records (600
> mb), it takes more than 25 minutes to finish.
> The spark version we are using is 1.0 and we were running this program on a
> 4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
>
> The 5 million records only have two distinct records (One positive and one
> negative), others are all duplications.
>
> Any one has any idea on why it takes so long on this small data?
>
>
>
> Thanks,
> Best,
>
> Peng

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best way to partition RDD

Hi Helena,

Well... I am just running a toy example, I have one Cassandra node
co-located with the Spark Master and one of Spark Workers, all in one
machine. I have another node which runs the second Spark worker.

/Shahab,


On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson  wrote:

> Hi Shahab,
> -How many spark/cassandra nodes are in your cluster?
> -What is your deploy topology for spark and cassandra clusters? Are they
> co-located?
>
> - Helena
> @helenaedelson
>
> On Oct 30, 2014, at 12:16 PM, shahab  wrote:
>
> Hi.
>
> I am running an application in the Spark which first loads data from
> Cassandra and then performs some map/reduce jobs.
>
> val srdd = sqlContext.sql("select * from mydb.mytable "  )
> I noticed that the "srdd" only has one partition . no matter how big is
> the data loaded form Cassandra.
>
> So I perform "repartition" on the RDD , and then I did the map/reduce
> functions.
>
> But the main problem is that "repartition" takes so much time (almost 2
> min), which is not acceptable in my use-case. Is there any better way to do
> repartitioning?
>
> best,
> /Shahab
>
>
>

Re: Best way to partition RDD

2014-10-30 Thread Helena Edelson

Hi Shahab, 
-How many spark/cassandra nodes are in your cluster?
-What is your deploy topology for spark and cassandra clusters? Are they 
co-located?

- Helena
@helenaedelson

On Oct 30, 2014, at 12:16 PM, shahab  wrote:

> Hi.
> 
> I am running an application in the Spark which first loads data from 
> Cassandra and then performs some map/reduce jobs.
> val srdd = sqlContext.sql("select * from mydb.mytable "  )
> 
> I noticed that the "srdd" only has one partition . no matter how big is the 
> data loaded form Cassandra.
> 
> So I perform "repartition" on the RDD , and then I did the map/reduce 
> functions.
> 
> But the main problem is that "repartition" takes so much time (almost 2 min), 
> which is not acceptable in my use-case. Is there any better way to do 
> repartitioning?
> 
> best,
> /Shahab

Re: Algebird using spark-shell

2014-10-30 Thread Buntu Dev

Thanks.. I was using Scala 2.11.1 and was able to
use algebird-core_2.10-0.1.11.jar with spark-shell.

On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell  wrote:

> Whats the error with the 2.10 version of algebird?
>
> On Thu, Oct 30, 2014 at 12:49 AM, thadude  wrote:
>
>> I've tried:
>>
>> . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar
>>
>> scala> import com.twitter.algebird._
>> import com.twitter.algebird._
>>
>> scala>  import HyperLogLog._
>> import HyperLogLog._
>>
>> scala> import com.twitter.algebird.HyperLogLogMonoid
>> import com.twitter.algebird.HyperLogLogMonoid
>>
>>
>> scala> val hll = new HyperLogLogMonoid(12)
>> hll: com.twitter.algebird.HyperLogLogMonoid =
>> com.twitter.algebird.HyperLogLogMonoid@7bde289a
>>
>>
>> https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Algebird-using-spark-shell-tp17701p17714.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen

Hi all,

I'd like to be able to modify values in a DStream, and then send it off to
an external source like Cassandra, but I keep getting Serialization errors
and am not sure how to use the correct design pattern. I was wondering if
you could help me.

I'd like to be able to do the following:

 wordCounts.foreachRDD( rdd => {

   val arr = record.toArray
   ...

})

I would like to use the "arr" to send back to cassandra, for instance:

Use it like this:

val collection = sc.parallelize(Seq(a.head._1, a.head_.2))
collection.saveToCassandra()

Or something like that, but as you know, I can't do this within the
"foreacRDD" but only at the driver level. How do I use the "arr" variable
to do something like that ?

Thanks for any help,

Harold

Re: Manipulating RDDs within a DStream

2014-10-30 Thread Harold Nguyen

Hi,

Sorry, there's a typo there:

val arr = rdd.toArray


Harold

On Thu, Oct 30, 2014 at 9:58 AM, Harold Nguyen  wrote:

> Hi all,
>
> I'd like to be able to modify values in a DStream, and then send it off to
> an external source like Cassandra, but I keep getting Serialization errors
> and am not sure how to use the correct design pattern. I was wondering if
> you could help me.
>
> I'd like to be able to do the following:
>
>  wordCounts.foreachRDD( rdd => {
>
>val arr = record.toArray
>...
>
> })
>
> I would like to use the "arr" to send back to cassandra, for instance:
>
> Use it like this:
>
> val collection = sc.parallelize(Seq(a.head._1, a.head_.2))
> collection.saveToCassandra()
>
> Or something like that, but as you know, I can't do this within the
> "foreacRDD" but only at the driver level. How do I use the "arr" variable
> to do something like that ?
>
> Thanks for any help,
>
> Harold
>
>

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Vladimir Rodionov

GC limit overhead exceeded is usually sign of either inadequate heap size
(not the case here) or application produces garbage (temp objects) faster
than garbage collector collects them - GC consumes most CPU cycles. 17G of
Java heap is quite large for many application and is above "safe and
recommended" limit (6-8GB) for Java server application.

>From what I saw in the stack trace I can conclude that some operations in
RDD implementation are heap polluters. I am not the expert in Spark but it
seems that Spark is not well optimized yet to work with reasonably large
Java heaps.

One of the options here is try to reduce JVM heap size and reduce data size
per JVM instance.

-Vladimir Rodionov



On Thu, Oct 30, 2014 at 5:22 AM, Ilya Ganelin  wrote:

> The split is something like 30 million into 2 milion partitions. The
> reason that it becomes tractable is that after I perform the Cartesian on
> the split data and operate on it I don't keep the full results - I actually
> only keep a tiny fraction of that generated dataset - making the overall
> dataset tractable ( I neglected to mention this in the first email).
>
> The way the code is structured I have forced linear execution until this
> point so at the time of execution of the split it is the only thing
> happening. In terms of memory I have assigned 23gb of memory and 17gb of
> heap.
> On Oct 30, 2014 3:32 AM, "Sean Owen"  wrote:
>
>> Can you be more specific about numbers?
>> I am not sure that splitting helps so much in the end, in that it has
>> the same effect as executing a smaller number at a time of the large
>> number of tasks that the full cartesian join would generate.
>> The full join is probably intractable no matter what in this case?
>> The OOM is not necessarily directly related. It depends on where it
>> happened, what else you are doing, how much memory you gave, etc.
>>
>> On Thu, Oct 30, 2014 at 3:29 AM, Ganelin, Ilya
>>  wrote:
>> > Hey all – not writing to necessarily get a fix but more to get an
>> > understanding of what’s going on internally here.
>> >
>> > I wish to take a cross-product of two very large RDDs (using
>> cartesian), the
>> > product of which is well in excess of what can be stored on disk .
>> Clearly
>> > that is intractable, thus my solution is to do things in batches -
>> > essentially I can take the cross product of a small piece of the first
>> data
>> > set with the entirety of the other. To do this, I calculate how many
>> items
>> > can fit into 1 gig of memory. Next, I use RDD.random Split() to
>> partition
>> > the first data set. The issue is that I am trying to partition an RDD of
>> > several million items into several million partitions. This throws the
>> > following error:
>> >
>> > I would like to understand the internals of what’s going on here so
>> that I
>> > can adjust my approach accordingly. Thanks in advance.
>> >
>> >
>> > 14/10/29 22:17:44 ERROR ActorSystemImpl: Uncaught fatal error from
>> thread
>> > [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem
>> > [sparkDriver]
>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>> > at com.google.protobuf_spark.ByteString.toByteArray(ByteString.java:213)
>> > at
>> akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:24)
>> > at
>> >
>> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55)
>> > at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55)
>> > at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73)
>> > at
>> >
>> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764)
>> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> > at
>> >
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> > at
>> >
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> > at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> > at
>> >
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
>> > exceeded
>> > at java.util.Arrays.copyOfRange(Arrays.java:2694)
>> > at java.lang.String.(String.java:203)
>> > at java.lang.String.substring(String.java:1913)
>> > at java.lang.String.subSequence(String.java:1946)
>> > at java.util.regex.Matcher.getSubSequence(Matcher.java:1245)
>> > at java.util.regex.Matcher.group(Matcher.java:490)
>> > at java.util.Formatter$FormatSpecifier.(Formatter.java:2675)
>> > at java.util.Formatter.parse(Formatter.java:2528)
>> > at java.util.Formatter.format(Formatter.java:2469)
>> > at java.util.Formatter.format(Formatter.java:2423)
>> > at java.lang.St

Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic

I'm  connecting to it remotly with tableau/beeline.

On Thu Oct 30 16:51:13 2014 GMT+0100, Denny Lee [via Apache Spark User List] 
wrote:
> 
> 
> When you are starting the thrift server service - are you connecting to it
> locally or is this on a remote server when you use beeline and/or Tableau?
> 
> On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic  wrote:
> 
> > I use beta driver SQL ODBC from Databricks.
> >
> >
> >
> > --
> > View this message in context:
> > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >
> 
> 
> 
> 
> ___
> If you reply to this email, your message will be added to the discussion 
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17734.html
> To start a new topic under Apache Spark User List, email 
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Spark + Tableau, visit 
> http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=17720&code=Ymxvb2Q5cmF2ZW5AZ21haWwuY29tfDE3NzIwfDU5NzgxNDc0Ng=

-- 
Sent from my Jolla



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17737.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Best way to partition RDD

Hi.

I am running an application in the Spark which first loads data from
Cassandra and then performs some map/reduce jobs.

val srdd = sqlContext.sql("select * from mydb.mytable "  )
I noticed that the "srdd" only has one partition . no matter how big is the
data loaded form Cassandra.

So I perform "repartition" on the RDD , and then I did the map/reduce
functions.

But the main problem is that "repartition" takes so much time (almost 2
min), which is not acceptable in my use-case. Is there any better way to do
repartitioning?

best,
/Shahab

Re: issue on applying SVM to 5 million examples.

2014-10-30 Thread Jimmy

Watch the app manager it should tell you what's running and taking awhile... My 
guess it's a "distinct" function on the data.
J

Sent from my iPhone

> On Oct 30, 2014, at 8:22 AM, peng xia  wrote:
> 
> Hi,
> 
>  
> 
> Previous we have applied SVM algorithm in MLlib to 5 million records (600 
> mb), it takes more than 25 minutes to finish.
> The spark version we are using is 1.0 and we were running this program on a 4 
> nodes cluster. Each node has 4 cpu cores and 11 GB RAM.
> 
> The 5 million records only have two distinct records (One positive and one 
> negative), others are all duplications.
> 
> Any one has any idea on why it takes so long on this small data?
> 
>  
> 
> Thanks,
> Best,
> 
> Peng

Re: Spark + Tableau

2014-10-30 Thread Denny Lee

When you are starting the thrift server service - are you connecting to it
locally or is this on a remote server when you use beeline and/or Tableau?

On Thu, Oct 30, 2014 at 8:00 AM, Bojan Kostic  wrote:

> I use beta driver SQL ODBC from Databricks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

RE: problem with start-slaves.sh

2014-10-30 Thread Pagliari, Roberto

I also didn’t realize I was trying to bring up the 2ndNameNode as a slave.. 
that might be an issue as well..

Thanks,

From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
Sent: Thursday, October 30, 2014 11:27 AM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: problem with start-slaves.sh

Roberto, I don't think shark is an issue -- I have shark server running on a 
node that also acts as a worker. What you can do is turn off shark server, just 
run start-all to start your spark cluster. then you can try bin/spark-shell 
--master  and see if you can successfully run some "hello world" 
stuff. This will verify you have a working Spark cluster. Shark is just an 
application on top of it, so I can't imagine that's what's causing 
interference. But stopping it is the simplest way to check.

On Wed, Oct 29, 2014 at 10:54 PM, Pagliari, Roberto 
mailto:rpagli...@appcomsci.com>> wrote:
hi Yana,
in my case I did not start any spark worker. However, shark was definitely 
running. Do you think that might be a problem?

I will take a look

Thank you,

From: Yana Kadiyska [yana.kadiy...@gmail.com]
Sent: Wednesday, October 29, 2014 9:45 AM
To: Pagliari, Roberto
Cc: user@spark.apache.org
Subject: Re: problem with start-slaves.sh
I see this when I start a worker and then try to start it again forgetting it's 
already running (I don't use start-slaves, I start the slaves individually with 
start-slave.sh). All this is telling you is that there is already a running 
process on that machine. You can see it if you do a ps -aef|grep worker

you can look on the spark UI and see if your master shows this machine as 
connected to it already. If it doesn't, you might want to kill the worker 
process and restart it.

On Tue, Oct 28, 2014 at 4:32 PM, Pagliari, Roberto 
mailto:rpagli...@appcomsci.com>> wrote:
I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build with PHive 
option to be able to interface with hive)

I’m getting this

ip_address: org.apache.spark.deploy.worker.Worker running as process . Stop 
it first.

Am I doing something wrong? In my specific case, shark+hive is running on the 
nodes. Does that interfere with spark?

Thank you,

Re: problem with start-slaves.sh

2014-10-30 Thread Yana Kadiyska

Roberto, I don't think shark is an issue -- I have shark server running on
a node that also acts as a worker. What you can do is turn off shark
server, just run start-all to start your spark cluster. then you can try
bin/spark-shell --master  and see if you can successfully run
some "hello world" stuff. This will verify you have a working Spark
cluster. Shark is just an application on top of it, so I can't imagine
that's what's causing interference. But stopping it is the simplest way to
check.

On Wed, Oct 29, 2014 at 10:54 PM, Pagliari, Roberto  wrote:

>  hi Yana,
> in my case I did not start any spark worker. However, shark was definitely
> running. Do you think that might be a problem?
>
>  I will take a look
>
>  Thank you,
>
>  --
> *From:* Yana Kadiyska [yana.kadiy...@gmail.com]
> *Sent:* Wednesday, October 29, 2014 9:45 AM
> *To:* Pagliari, Roberto
> *Cc:* user@spark.apache.org
> *Subject:* Re: problem with start-slaves.sh
>
>   I see this when I start a worker and then try to start it again
> forgetting it's already running (I don't use start-slaves, I start the
> slaves individually with start-slave.sh). All this is telling you is that
> there is already a running process on that machine. You can see it if you
> do a ps -aef|grep worker
>
>  you can look on the spark UI and see if your master shows this machine
> as connected to it already. If it doesn't, you might want to kill the
> worker process and restart it.
>
> On Tue, Oct 28, 2014 at 4:32 PM, Pagliari, Roberto <
> rpagli...@appcomsci.com> wrote:
>
>>  I ran sbin/start-master.sh followed by sbin/start-slaves.sh (I build
>> with PHive option to be able to interface with hive)
>>
>>
>>
>> I’m getting this
>>
>>
>>
>> ip_address: org.apache.spark.deploy.worker.Worker running as process
>> . Stop it first.
>>
>>
>>
>> Am I doing something wrong? In my specific case, shark+hive is running on
>> the nodes. Does that interfere with spark?
>>
>>
>>
>> Thank you,
>>
>
>

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell

Whats the error with the 2.10 version of algebird?

On Thu, Oct 30, 2014 at 12:49 AM, thadude  wrote:

> I've tried:
>
> . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar
>
> scala> import com.twitter.algebird._
> import com.twitter.algebird._
>
> scala>  import HyperLogLog._
> import HyperLogLog._
>
> scala> import com.twitter.algebird.HyperLogLogMonoid
> import com.twitter.algebird.HyperLogLogMonoid
>
>
> scala> val hll = new HyperLogLogMonoid(12)
> hll: com.twitter.algebird.HyperLogLogMonoid =
> com.twitter.algebird.HyperLogLogMonoid@7bde289a
>
>
> https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Algebird-using-spark-shell-tp17701p17714.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

issue on applying SVM to 5 million examples.

Hi,



Previous we have applied SVM algorithm in MLlib to 5 million records (600
mb), it takes more than 25 minutes to finish.
The spark version we are using is 1.0 and we were running this program on a
4 nodes cluster. Each node has 4 cpu cores and 11 GB RAM.

The 5 million records only have two distinct records (One positive and one
negative), others are all duplications.

Any one has any idea on why it takes so long on this small data?



Thanks,
Best,

Peng

Using a Database to persist and load data from

2014-10-30 Thread Asaf Lahav

Hi Ladies and Gents,
I would like to know what are the options I have if I would like to
leverage Spark code I already have written to use a DB (Vertica) as its
store/datasource.
The data is of tabular nature. So any relational DB can essentially be used.

Do I need to develop a context? If yes, how? where can I get a good example?


Thank you,
Asaf

Re: Spark + Tableau

2014-10-30 Thread Bojan Kostic

I use beta driver SQL ODBC from Databricks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720p17727.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark + Tableau

2014-10-30 Thread Jimmy

What ODBC driver are you using? We recently got the Hortonworks JODBC drivers 
working on a Windows box but was having issues with Mac



Sent from my iPhone

> On Oct 30, 2014, at 4:23 AM, Bojan Kostic  wrote:
> 
> I'm testing beta driver from Databricks for Tableua.
> And unfortunately i encounter some issues.
> While beeline connection works without problems, Tableau can't connect to
> spark thrift server.
> 
> Error from driver(Tableau):
> Unable to connect to the ODBC Data Source. Check that the necessary drivers
> are installed and that the connection properties are valid.
> [Simba][SparkODBC] (34) Error from Spark: ETIMEDOUT.
> 
> Unable to connect to the server "test.server.com". Check that the server is
> running and that you have access privileges to the requested database.
> Unable to connect to the server. Check that the server is running and that
> you have access privileges to the requested database.
> 
> Exception on Thrift server:
> java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
>at
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
>at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:722)
> Caused by: org.apache.thrift.transport.TTransportException
>at
> org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
>at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
>at
> org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
>at
> org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
>at
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
>at
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>at
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>... 4 more
> 
> Is there anyone else who's testing this driver, or did anyone saw this
> message?
> 
> Best regards
> Bojan Kostić
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Returned type of Broadcast variable is byte array

2014-10-30 Thread Stephen Boesch

As a template for creating a broadcast variable, the following code snippet
within mllib was used:

val bcIdf = dataset.context.broadcast(idf)
dataset.mapPartitions { iter =>
  val thisIdf = bcIdf.value


The new code follows that model:

import org.apache.spark.mllib.linalg.{Vector => MVector}
  ..
assert(crows.isInstanceOf[Array[MVector]])
val bcRows = sc.broadcast(crows)
val GU = mat.rows.zipWithIndex.mapPartitions { case dataIter =>
val arrayVect = bcRows.value // bcRows.value is seen in
debugger to be of type Array[Byte] .. ??

That last line is unhappy:

   java.lang.ClassCastException: [B cannot be cast to
[Lorg.apache.spark.mllib.linalg.Vector;

So the compiler is aware that the return type of the broadcast "value"
method should be an array of vector (which it should). However the actual
type is Array[Byte].   Any insights on this?

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin

The split is something like 30 million into 2 milion partitions. The reason
that it becomes tractable is that after I perform the Cartesian on the
split data and operate on it I don't keep the full results - I actually
only keep a tiny fraction of that generated dataset - making the overall
dataset tractable ( I neglected to mention this in the first email).

The way the code is structured I have forced linear execution until this
point so at the time of execution of the split it is the only thing
happening. In terms of memory I have assigned 23gb of memory and 17gb of
heap.
On Oct 30, 2014 3:32 AM, "Sean Owen"  wrote:

> Can you be more specific about numbers?
> I am not sure that splitting helps so much in the end, in that it has
> the same effect as executing a smaller number at a time of the large
> number of tasks that the full cartesian join would generate.
> The full join is probably intractable no matter what in this case?
> The OOM is not necessarily directly related. It depends on where it
> happened, what else you are doing, how much memory you gave, etc.
>
> On Thu, Oct 30, 2014 at 3:29 AM, Ganelin, Ilya
>  wrote:
> > Hey all – not writing to necessarily get a fix but more to get an
> > understanding of what’s going on internally here.
> >
> > I wish to take a cross-product of two very large RDDs (using cartesian),
> the
> > product of which is well in excess of what can be stored on disk .
> Clearly
> > that is intractable, thus my solution is to do things in batches -
> > essentially I can take the cross product of a small piece of the first
> data
> > set with the entirety of the other. To do this, I calculate how many
> items
> > can fit into 1 gig of memory. Next, I use RDD.random Split() to partition
> > the first data set. The issue is that I am trying to partition an RDD of
> > several million items into several million partitions. This throws the
> > following error:
> >
> > I would like to understand the internals of what’s going on here so that
> I
> > can adjust my approach accordingly. Thanks in advance.
> >
> >
> > 14/10/29 22:17:44 ERROR ActorSystemImpl: Uncaught fatal error from thread
> > [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem
> > [sparkDriver]
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at com.google.protobuf_spark.ByteString.toByteArray(ByteString.java:213)
> > at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:24)
> > at
> >
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55)
> > at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55)
> > at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73)
> > at
> >
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764)
> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> > at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> > at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> > at
> >
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> > at
> >
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> > at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> > at
> >
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> > exceeded
> > at java.util.Arrays.copyOfRange(Arrays.java:2694)
> > at java.lang.String.(String.java:203)
> > at java.lang.String.substring(String.java:1913)
> > at java.lang.String.subSequence(String.java:1946)
> > at java.util.regex.Matcher.getSubSequence(Matcher.java:1245)
> > at java.util.regex.Matcher.group(Matcher.java:490)
> > at java.util.Formatter$FormatSpecifier.(Formatter.java:2675)
> > at java.util.Formatter.parse(Formatter.java:2528)
> > at java.util.Formatter.format(Formatter.java:2469)
> > at java.util.Formatter.format(Formatter.java:2423)
> > at java.lang.String.format(String.java:2790)
> > at
> scala.collection.immutable.StringLike$class.format(StringLike.scala:266)
> > at scala.collection.immutable.StringOps.format(StringOps.scala:31)
> > at org.apache.spark.util.Utils$.getCallSite(Utils.scala:944)
> > at org.apache.spark.rdd.RDD.(RDD.scala:1227)
> > at org.apache.spark.rdd.RDD.(RDD.scala:83)
> > at
> >
> org.apache.spark.rdd.PartitionwiseSampledRDD.(PartitionwiseSampledRDD.scala:47)
> > at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:378)
> > at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:377)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growabl

Re: Getting vector values

Call toArray on the Vector and print that, or toBreeze.
On Oct 30, 2014 12:38 PM, "Andrejs Abele"  wrote:

> Hi,
>
> I'm new to Mllib and spark.  I'm trying to use tf-idf and use those values
> for term ranking.
> I'm getting tf values in vector format, but how can get the values of
> vector?
>
>  val sc = new SparkContext(conf)
>  val documents: RDD[Seq[String]] =
> sc.textFile("/home/andrejs/Datasets/dbpedia/test.txt").map(_.split("
> ").toSeq)
>documents.foreach(println(_))
>val hashingTF = new HashingTF()
>val tf: RDD[Vector] = hashingTF.transform(documents)
>
>tf.foreach(println(_))
>
> My output is :
> WrappedArray(a, a, b, c)
> WrappedArray(e, a, c, d)
>
> (1048576,[97,99,100,101],[1.0,1.0,1.0,1.0])
> (1048576,[97,98,99],[2.0,1.0,1.0])
>
> How can I get  [97,99,100,101] out, and [1.0,1.0,1.0,1.0] ?
> And how can I map that 100 = 1.0  ?
>
> Some help is greatly appreciated,
>
> Andrejs
>

Getting vector values

2014-10-30 Thread Andrejs Abele

Hi,

I'm new to Mllib and spark.  I'm trying to use tf-idf and use those values
for term ranking.
I'm getting tf values in vector format, but how can get the values of
vector?

 val sc = new SparkContext(conf)
 val documents: RDD[Seq[String]] =
sc.textFile("/home/andrejs/Datasets/dbpedia/test.txt").map(_.split("
").toSeq)
   documents.foreach(println(_))
   val hashingTF = new HashingTF()
   val tf: RDD[Vector] = hashingTF.transform(documents)

   tf.foreach(println(_))

My output is :
WrappedArray(a, a, b, c)
WrappedArray(e, a, c, d)

(1048576,[97,99,100,101],[1.0,1.0,1.0,1.0])
(1048576,[97,98,99],[2.0,1.0,1.0])

How can I get  [97,99,100,101] out, and [1.0,1.0,1.0,1.0] ?
And how can I map that 100 = 1.0  ?

Some help is greatly appreciated,

Andrejs

Spark + Tableau

2014-10-30 Thread Bojan Kostic

I'm testing beta driver from Databricks for Tableua.
And unfortunately i encounter some issues.
While beeline connection works without problems, Tableau can't connect to
spark thrift server.

Error from driver(Tableau):
Unable to connect to the ODBC Data Source. Check that the necessary drivers
are installed and that the connection properties are valid.
[Simba][SparkODBC] (34) Error from Spark: ETIMEDOUT.

Unable to connect to the server "test.server.com". Check that the server is
running and that you have access privileges to the requested database.
Unable to connect to the server. Check that the server is running and that
you have access privileges to the requested database.

Exception on Thrift server:
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.thrift.transport.TTransportException
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
at
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
... 4 more

Is there anyone else who's testing this driver, or did anyone saw this
message?

Best regards
Bojan Kostić

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Tableau-tp17720.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Debugging

2014-10-30 Thread Akhil Das

Thanks
Best Regards

On Thu, Oct 30, 2014 at 1:43 PM, Akhil Das 
wrote:

> Awesome.
>
> Thanks
> Best Regards
>
> On Thu, Oct 30, 2014 at 1:30 PM, Naveen Kumar Pokala <
> npok...@spcapitaliq.com> wrote:
>
>> Thanks Akhil, It is working.
>>
>>
>>
>> Regards,
>>
>> Naveen
>>
>>
>>
>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>> *Sent:* Thursday, October 30, 2014 1:24 PM
>>
>> *To:* Naveen Kumar Pokala
>> *Subject:* Re: Spark Debugging
>>
>>
>>
>> Yes you can, but the values might not give you a complete clue. Give it a
>> try
>>
>>
>> Thanks
>>
>> Best Regards
>>
>>
>>
>> On Thu, Oct 30, 2014 at 1:20 PM, Naveen Kumar Pokala <
>> npok...@spcapitaliq.com> wrote:
>>
>> Do you mean can’t I debug line by line?
>>
>>
>>
>> Thanks and Regards
>>
>> Naveen
>>
>>
>>
>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>> *Sent:* Thursday, October 30, 2014 1:19 PM
>>
>>
>> *To:* Naveen Kumar Pokala
>> *Subject:* Re: Spark Debugging
>>
>>
>>
>> Master machine. The execution will get pause in eclipse regardless of
>> where its executing the program, you might not get a complete flow, but you
>> can see whats happening.
>>
>>
>> Thanks
>>
>> Best Regards
>>
>>
>>
>> On Thu, Oct 30, 2014 at 1:10 PM, Naveen Kumar Pokala <
>> npok...@spcapitaliq.com> wrote:
>>
>> Hi Akhil,
>>
>>
>>
>> With below configuration I am successfully able to connect to  Node
>> A(Master node) where I am running the spark submit program, Spark may give
>> driver program to another Node B also.
>>
>>
>>
>> Which machine IP should I use to debug the programs from eclipse.
>>
>>
>>
>> Thanks,
>>
>> Naveen
>>
>>
>>
>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>> *Sent:* Thursday, October 30, 2014 12:35 PM
>> *To:* Naveen Kumar Pokala
>> *Subject:* Re: Spark Debugging
>>
>>
>>
>> Hi Naveen,
>>
>>
>>
>> You should be able to connect to  port also from your windows machine.
>> 
>>
>>
>> Thanks
>>
>> Best Regards
>>
>>
>>
>> On Thu, Oct 30, 2014 at 12:32 PM, Naveen Kumar Pokala <
>> npok...@spcapitaliq.com> wrote:
>>
>> Hi Akhil,
>>
>>
>>
>> I have gone through the article. But I am  not starting my java program
>> with Java command, I am submitting through spark-submit and I have added
>> like below is that ok?
>>
>>
>>
>> bin/spark-submit --class sample.spark.test.SparkJob --master yarn-cluster
>> --driver-memory 4g --executor-memory  --conf
>> "spark.executor.extraJavaOptions=-Xdebug
>> -agentlib:jdwp=transport=dt_socket,address=,server=y,suspend=n"
>> /home/npokala/data/test.jar
>>
>>
>>
>> I am able to connect to cluster from my windows machine only through
>> putty. I am able to access the cluster information and spark information
>> from my browser. Is that sufficient.?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Naveen.
>>
>>
>>
>> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
>> *Sent:* Thursday, October 30, 2014 12:10 PM
>> *To:* Naveen Kumar Pokala
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark Debugging
>>
>>
>>
>> Its called remote debugging. You can read this article
>> 
>> for more information. You will have to make sure that the network between
>> your cluster and windows machine can communicate with each other.
>>
>>
>> Thanks
>>
>> Best Regards
>>
>>
>>
>> On Thu, Oct 30, 2014 at 12:06 PM, Naveen Kumar Pokala <
>> npok...@spcapitaliq.com> wrote:
>>
>> Hi,
>>
>>
>>
>>   I have installed 2 node hadoop cluster (For example, on Unix machines A
>> and B. A master node and data node, B is data node)
>>
>>
>>
>>   I am submitting my driver programs through SPARK 1.1.0 with
>> bin/spark-submit from Putty Client from my Windows machine.
>>
>>
>>
>> I want to debug my program from Eclipse on my local machine. I am not
>> able to find a way to debug.
>>
>>
>>
>> Please let me know the ways to debug my driver program as well as
>> executor programs
>>
>>
>>
>>
>>
>> Regards,
>>
>>
>>
>> Naveen.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

sharing RDDs between PySpark and Scala

2014-10-30 Thread rok

I'm processing some data using PySpark and I'd like to save the RDDs to disk
(they are (k,v) RDDs of strings and SparseVector types) and read them in
using Scala to run them through some other analysis. Is this possible? 

Thanks,

Rok



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sharing-RDDs-between-PySpark-and-Scala-tp17718.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: BUG: when running as "extends App", closures don't capture variables

Very coincidentally I ran into something equally puzzling yesterday where
something was bizarrely null when it can't have been in a Spark program
that extends App. I also changed to use main() and it works fine. So
definitely some issue here. If nobody makes a JIRA before I get home I'll
do it.
On Oct 29, 2014 11:20 PM, "Michael Albert" 
wrote:

> Greetings!
>
> This might be a documentation issue as opposed to a coding issue, in that
> perhaps the correct answer is "don't do that", but as this is not obvious,
> I am writing.
>
> The following code produces output most would not expect:
>
> package misc
>
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
>
> object DemoBug extends App {
> val conf = new SparkConf()
> val sc = new SparkContext(conf)
>
> val rdd = sc.parallelize(List("A","B","C","D"))
> val str1 = "A"
>
> val rslt1 = rdd.filter(x => { x != "A" }).count
> val rslt2 = rdd.filter(x => { str1 != null && x != "A" }).count
>
> println("DemoBug: rslt1 = " + rslt1 + " rslt2 = " + rslt2)
> }
>
> This produces the output:
> DemoBug: rslt1 = 3 rslt2 = 0
>
> Compiled with sbt:
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0"
> Run on an EC2 EMR instance with a recent image (hadoop 2.4.0, spark 1.1.0)
>
> If instead there is a proper "main()", it works as expected.
>
> Thank you.
>
> Sincerely,
>  Mike
>

NonSerializable Exception in foreachRDD

2014-10-30 Thread Harold Nguyen

Hi all,

In Spark Streaming, when I do "foreachRDD" on my DStreams, I get a
NonSerializable exception when I try to do something like:

DStream.foreachRDD( rdd => {
  var sc.parallelize(Seq(("test", "blah")))
})

Is there any way around that ?

Thanks,

Harold

Re: Algebird using spark-shell

2014-10-30 Thread thadude

I've tried:

. /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala>  import HyperLogLog._
import HyperLogLog._

scala> import com.twitter.algebird.HyperLogLogMonoid
import com.twitter.algebird.HyperLogLogMonoid


scala> val hll = new HyperLogLogMonoid(12)
hll: com.twitter.algebird.HyperLogLogMonoid =
com.twitter.algebird.HyperLogLogMonoid@7bde289a


https://github.com/twitter/algebird/wiki/Algebird-Examples-with-REPL




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Algebird-using-spark-shell-tp17701p17714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: GC Issues with randomSplit on large dataset

Can you be more specific about numbers?
I am not sure that splitting helps so much in the end, in that it has
the same effect as executing a smaller number at a time of the large
number of tasks that the full cartesian join would generate.
The full join is probably intractable no matter what in this case?
The OOM is not necessarily directly related. It depends on where it
happened, what else you are doing, how much memory you gave, etc.

On Thu, Oct 30, 2014 at 3:29 AM, Ganelin, Ilya
 wrote:
> Hey all – not writing to necessarily get a fix but more to get an
> understanding of what’s going on internally here.
>
> I wish to take a cross-product of two very large RDDs (using cartesian), the
> product of which is well in excess of what can be stored on disk . Clearly
> that is intractable, thus my solution is to do things in batches -
> essentially I can take the cross product of a small piece of the first data
> set with the entirety of the other. To do this, I calculate how many items
> can fit into 1 gig of memory. Next, I use RDD.random Split() to partition
> the first data set. The issue is that I am trying to partition an RDD of
> several million items into several million partitions. This throws the
> following error:
>
> I would like to understand the internals of what’s going on here so that I
> can adjust my approach accordingly. Thanks in advance.
>
>
> 14/10/29 22:17:44 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.actor.default-dispatcher-16] shutting down ActorSystem
> [sparkDriver]
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> at com.google.protobuf_spark.ByteString.toByteArray(ByteString.java:213)
> at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:24)
> at
> akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55)
> at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73)
> at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit
> exceeded
> at java.util.Arrays.copyOfRange(Arrays.java:2694)
> at java.lang.String.(String.java:203)
> at java.lang.String.substring(String.java:1913)
> at java.lang.String.subSequence(String.java:1946)
> at java.util.regex.Matcher.getSubSequence(Matcher.java:1245)
> at java.util.regex.Matcher.group(Matcher.java:490)
> at java.util.Formatter$FormatSpecifier.(Formatter.java:2675)
> at java.util.Formatter.parse(Formatter.java:2528)
> at java.util.Formatter.format(Formatter.java:2469)
> at java.util.Formatter.format(Formatter.java:2423)
> at java.lang.String.format(String.java:2790)
> at scala.collection.immutable.StringLike$class.format(StringLike.scala:266)
> at scala.collection.immutable.StringOps.format(StringOps.scala:31)
> at org.apache.spark.util.Utils$.getCallSite(Utils.scala:944)
> at org.apache.spark.rdd.RDD.(RDD.scala:1227)
> at org.apache.spark.rdd.RDD.(RDD.scala:83)
> at
> org.apache.spark.rdd.PartitionwiseSampledRDD.(PartitionwiseSampledRDD.scala:47)
> at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:378)
> at org.apache.spark.rdd.RDD$$anonfun$randomSplit$1.apply(RDD.scala:377)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD.randomSplit(RDD.scala:379)
>
>
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary
> to Capital One and/or its

Re: Java api overhead?