date:20141031

What do your worker logs say?

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Fri, Oct 31, 2014 at 11:44 AM, sivarani whitefeathers...@gmail.com
wrote:

 I tried running it but dint work

 public static final SparkConf batchConf= new SparkConf();
 String master = spark://sivarani:7077;
 String spark_home =/home/sivarani/spark-1.0.2-bin-hadoop2/;
 String jar = /home/sivarani/build/Test.jar;
 public static final JavaSparkContext batchSparkContext = new
 JavaSparkContext(master,SparkTest,spark_home,new String[] {jar});

 public static void main(String args[]){
 runSpark(0,TestSubmit);}

 public static void runSpark(int crit, String dataFile){
 JavaRDDString logData = batchSparkContext.textFile(input, 10);
 flatMap
  maptoparr
 reduceByKey
 ListTuple2lt;String, Integer output1 = counts.collect();
 }


 This works fine with spark-submit but when i tried to submit through code
 LeadBatchProcessing.runSpark(0, TestSubmit.csv);

 I get this following error

 HTTP Status 500 - javax.servlet.ServletException:
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 0.0:0 failed 4 times, most recent failure: TID 29 on host 172.18.152.36
 failed for unknown reason
 Job aborted due to stage failure: Task 0.0:0 failed 4 times, most recent
 failure: TID 29 on host 172.18.152.36 failed for unknown reason Driver
 stacktrace:



 Any Advice on this?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Submiting-Spark-application-through-code-tp17452p17797.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Manipulating RDDs within a DStream

2014-10-31 Thread lalit1303

Hi,

Since, the cassandra object is not serializable you can't open the
connection on driver level and access the object inside foreachRDD (i.e. at
worker level).
You have to open connection inside foreachRDD only, perform the operation
and then close the connection.

For example:

 wordCounts.foreachRDD( rdd = {

   val arr = rdd.toArray
   
   OPEN cassandra connection
   store arr 
   CLOSE cassandra connection

})


Thanks



-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: NonSerializable Exception in foreachRDD

2014-10-31 Thread Akhil Das

Are you expecting something like this?


val data = ssc.textFileStream(hdfs://akhldz:9000/input/)


 val rdd = ssc.sparkContext.parallelize(Seq(foo, bar))

 val sample = data.foreachRDD(x= {
   val new_rdd = x.union(rdd)
   new_rdd.saveAsTextFile(hdfs://akhldz:9000/output/)
 })

Thanks
Best Regards

On Fri, Oct 31, 2014 at 10:46 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 Harold,

 just mentioning it in case you run into it: If you are in a separate
 thread, there are apparently stricter limits to what you can and cannot
 serialize:

 val someVal
 future {
   // be very careful with defining RDD operations using someVal here
   val myLocalVal = someVal
   // use myLocalVal instead
 }

 On Thu, Oct 30, 2014 at 4:55 PM, Harold Nguyen har...@nexgate.com wrote:

 In Spark Streaming, when I do foreachRDD on my DStreams, I get a
 NonSerializable exception when I try to do something like:

 DStream.foreachRDD( rdd = {
   var sc.parallelize(Seq((test, blah)))
 })


 Is this the code you are actually using? var sc.parallelize(...) doesn't
 really look like valid Scala to me.

 Tobias

Re: Spark Streaming Issue not running 24/7

2014-10-31 Thread Akhil Das

It says 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException
Can you try putting a try { }catch around all those operations that you are
doing on the DStream? In that way it will not stop the entire application
due to corrupt data/field etc.

Thanks
Best Regards

On Fri, Oct 31, 2014 at 10:09 AM, sivarani whitefeathers...@gmail.com
wrote:

 The problem is simple

 I want a to stream data 24/7 do some calculations and save the result in a
 csv/json file so that i could use it for visualization using dc.js/d3.js

 I opted for spark streaming on yarn cluster with kafka tried running it for
 24/7

 Using GroupByKey and updateStateByKey to have the computed historical data

 Initially streaming is working fine.. but after few hours i am getting

 14/10/30 23:48:49 ERROR TaskSetManager: Task 2485162.0:3 failed 4 times;
 aborting job
 14/10/30 23:48:50 ERROR JobScheduler: Error running job streaming job
 141469227 ms.1
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 2485162.0:3 failed 4 times, most recent failure: Exception failure in TID
 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException

 Driver stacktrace:
 at
 org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
 I guess its due to the GroupByKey and updateStateByKey, i tried
 GroupByKey(100) increased partition

 Also when data is in state say for eg 10th sec 1000 records are in state,
 100th sec 20,000 records are in state out of which 19,000 records are not
 updated how to remove them from state.. UpdateStateByKey(none) how and when
 to do that, how we will know when to send none, and save the data before
 setting none?

 I also tried not sending any data a few hours but check the web ui i am
 getting task FINISHED

 app-20141030203943- NewApp  0   6.0 GB  2014/10/30 20:39:43
  hadoop  FINISHED
 4.2 h

 This makes me confused.. In the code it says awaitTermination, but did not
 terminate the task.. will streaming stop if no data is received for a
 significant amount of time? Is there any doc available on how much time
 spark will run when no data is streamed? Any Doc available



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Issue-not-running-24-7-tp17791.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

different behaviour of the same code

2014-10-31 Thread lieyan

I am trying to write some sample code under IntelliJ IDEA. I start with a
non-sbt scala project. In order that the program compile, I add
*spark-assembly-1.1.0-hadoop2.4.0.jar* in the *spark/lib* directory as one
external library of the IDEA project. 
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/proj.jpg 


The code are here:  LogReg.scala
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/LogReg.scala  

Then I click the Run button of the IDEA, and I get the following error
message  
errlog.txt
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/errlog.txt 
.
But when I export the jar file, and use *spark-submit --class
net.yanl.spark.LogReg log_reg.jar 15*. The program works finely. 

This is somehow annoying. Can anyone resolve this issue?


You may need the following file to reproduce the error. 
out5_training.log/out5_testing.log
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17803/small01.log  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/different-behaviour-of-the-same-code-tp17803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using a Database to persist and load data from

2014-10-31 Thread Kamal Banga

You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an
OutputFormat class.
So you will have to write a custom OutputFormat class that extends
OutputFormat. In this class, you will have to implement a getRecordWriter
which returns a custom RecordWriter.
So you will also have to write a custom RecordWriter which extends
RecordWriter which will have a write method that actually writes to the DB.

On Fri, Oct 31, 2014 at 11:25 AM, Yanbo Liang yanboha...@gmail.com wrote:

 AFAIK, you can read data from DB with JdbcRDD, but there is no interface
 for writing to DB.
 JdbcRDD has some restrict such as  SQL must with where clause.
 For writing to DB, you can use mapPartitions or foreachPartition to
 implement.
 You can refer this example:

 http://stackoverflow.com/questions/24916852/how-can-i-connect-to-a-postgresql-database-into-apache-spark-using-scala

 2014-10-30 23:01 GMT+08:00 Asaf Lahav asaf.la...@gmail.com:

 Hi Ladies and Gents,
 I would like to know what are the options I have if I would like to
 leverage Spark code I already have written to use a DB (Vertica) as its
 store/datasource.
 The data is of tabular nature. So any relational DB can essentially be
 used.

 Do I need to develop a context? If yes, how? where can I get a good
 example?


 Thank you,
 Asaf

about aggregateByKey and standard deviation

2014-10-31 Thread qinwei







Hi, everyone    I have an RDD filled with data like        (k1, v11)        
(k1, v12)        (k1, v13)        (k2, v21)        (k2, v22)        (k2, v23)   
     ...
    I want to calculate the average and standard deviation of (v11, v12, v13) 
and (v21, v22, v23) group by there keys    for the moment, i have done that by 
using groupByKey and map, I notice that groupByKey is very expensive,  but i 
can not figure out how to do it by using aggregateByKey, so i wonder is there 
any better way to do this?
Thanks!


qinwei

Re: Using a Database to persist and load data from

I think you can try to use the Hadoop DBOutputFormat

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal

On Fri, Oct 31, 2014 at 1:00 PM, Kamal Banga ka...@sigmoidanalytics.com
wrote:

You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an
OutputFormat class.
So you will have to write a custom OutputFormat class that extends
OutputFormat. In this class, you will have to implement a getRecordWriter
which returns a custom RecordWriter.
So you will also have to write a custom RecordWriter which extends
RecordWriter which will have a write method that actually writes to the DB.

On Fri, Oct 31, 2014 at 11:25 AM, Yanbo Liang yanboha...@gmail.com
wrote:

AFAIK, you can read data from DB with JdbcRDD, but there is no interface
for writing to DB.
JdbcRDD has some restrict such as SQL must with where clause.
For writing to DB, you can use mapPartitions or foreachPartition to
implement.
You can refer this example:

http://stackoverflow.com/questions/24916852/how-can-i-connect-to-a-postgresql-database-into-apache-spark-using-scala

2014-10-30 23:01 GMT+08:00 Asaf Lahav asaf.la...@gmail.com:

Hi Ladies and Gents,
I would like to know what are the options I have if I would like to
leverage Spark code I already have written to use a DB (Vertica) as its
store/datasource.
The data is of tabular nature. So any relational DB can essentially be
used.

Do I need to develop a context? If yes, how? where can I get a good
example?

Thank you,
Asaf

Re: SparkContext UI

No, empty parens do no matter when calling no-arg methods in Scala.
This invocation should work as-is and should result in the RDD showing
in Storage. I see that when I run it right now.

Since it really does/should work, I'd look at other possibilities --
is it maybe taking a short time to start caching? looking at a
different/old Storage tab?

On Fri, Oct 31, 2014 at 1:17 AM, Sameer Farooqui same...@databricks.com wrote:
 Hi Stuart,

 You're close!

 Just add a () after the cache, like: data.cache()

 ...and then run the .count() action on it and you should be good to see it
 in the Storage UI!


 - Sameer

 On Thu, Oct 30, 2014 at 4:50 PM, Stuart Horsman stuart.hors...@gmail.com
 wrote:

 Sorry too quick to pull the trigger on my original email.  I should have
 added that I'm tried using persist() and cache() but no joy.

 I'm doing this:

 data = sc.textFile(somedata)

 data.cache

 data.count()

 but I still can't see anything in the storage?



 On 31 October 2014 10:42, Sameer Farooqui same...@databricks.com wrote:

 Hey Stuart,

 The RDD won't show up under the Storage tab in the UI until it's been
 cached. Basically Spark doesn't know what the RDD will look like until it's
 cached, b/c up until then the RDD is just on disk (external to Spark). If
 you launch some transformations + an action on an RDD that is purely on
 disk, then Spark will read it from disk, compute against it and then write
 the results back to disk or show you the results at the scala/python shells.
 But when you run Spark workloads against purely on disk files, the RDD won't
 show up in Spark's Storage UI. Hope that makes sense...

 - Sameer

 On Thu, Oct 30, 2014 at 4:30 PM, Stuart Horsman
 stuart.hors...@gmail.com wrote:

 Hi All,

 When I load an RDD with:

 data = sc.textFile(somefile)

 I don't see the resulting RDD in the SparkContext gui on localhost:4040
 in /storage.

 Is there something special I need to do to allow me to view this?  I
 tried but scala and python shells but same result.

 Thanks

 Stuart





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

cache() won't speed up a single operation on an RDD, since it is
computed the same way before it is persisted.

On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui same...@databricks.com wrote:
 By the way, in case you haven't done so, do try to .cache() the RDD before
 running a .count() on it as that could make a big speed improvement.



 On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com
 wrote:

 Hi Shahab,

 Are you running Spark in Local, Standalone, YARN or Mesos mode?

 If you're running in Standalone/YARN/Mesos, then the .count() action is
 indeed automatically parallelized across multiple Executors.

 When you run a .count() on an RDD, it is actually distributing tasks to
 different executors to each do a local count on a local partition and then
 all the tasks send their sub-counts back to the driver for final
 aggregation. This sounds like the kind of behavior you're looking for.

 However, in Local mode, everything runs in a single JVM (the driver +
 executor), so there's no parallelization across Executors.



 On Thu, Oct 30, 2014 at 10:25 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I noticed that the count (of RDD)  in many of my queries is the most
 time consuming one as it runs in the driver process rather then done by
 parallel worker nodes,

 Is there any way to perform count in parallel , at at least parallelize
 it as much as possible?

 best,
 /Shahab




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found

2014-10-31 Thread Dai, Kevin

Hi, all

My job failed and there are a lot of  ERROR ConnectionManager: Corresponding 
SendingConnection to ConnectionManagerId not found information in the log.

Can anyone tell me what's wrong and how to fix it?

Best Regards,
Kevin.

Spark SQL on Cassandra

2014-10-31 Thread cis

I am dealing with a Lambda Architecture. This means I have Hadoop on the
batch layer, Storm on the speed layer and I'm storing the precomputed views
from both layers in  Cassandra.

I understand that Spark is a substitute for Hadoop but at the moment I would
like not to change the batch layer.

I would like to execute SQL queries on Cassandra using Spark SQL. Is it
possible to get just Spark SQL to run on top of Cassandra, without Spark? My
goal is to access Cassandra data with BI tools. Spark SQL looks like the
perfect tool for this.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-on-Cassandra-tp17812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes


Hi,

I have inpot data that are many of very small files containing one .json. 


For performance reasons (I use PySpark) I have to do repartioning, currently I 
do:

sc.textFile(files).coalesce(100))
 
Problem is that I have to guess the number of partitions in a such way that 
it's as fast as possible and I am still on the sefe side with the RAM memory. 
So this is quiet difficult.

For this reason I would like to ask if there is some way, how to replace 
coalesce(100) by something that creates N partitions of the given size? I went 
through the documentation, but I was not able to find some way, how to do that.

thank you in advance for any help or advice. 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: how idf is calculated

2014-10-31 Thread Andrejs Abele

I found my problem. I assumed based on TF-IDF in  Wikipedia , that log base
10 is used, but as I found in this discussion
https://groups.google.com/forum/#!topic/scala-language/K5tbYSYqQc8, in
scala it is actually ln (natural logarithm).

Regards,
Andrejs

On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab as...@live.com wrote:

 Hi Andrejs,
 The calculations are a bit different to what I've come across in Mining
 Massive Datasets (2nd Ed. Ullman et. al.,  Cambridge Press) available here:
 http://www.mmds.org/

 Their calculation of IDF is as follows:

 IDFi = log2(N / ni)

 where N is the number of documents and ni is the number of documents in
 which the word appears. This looks different to your IDF function.

 For TF, they use

 TFij = fij / maxk fkj

 That is:

 For document j,
  the term frequency of the term i in j is the number of times i
 appears in j divided by the maximum number of times any term appears in j.
 Stop words are usually excluded when considering the maximum).

 So, in your case, the

 TFa1 = 2 / 2 = 1
 TFb1 = 1 / 2 = 0.5
 TFc1 = 1/2 = 0.5
 TFm1 = 2/2 = 1
 ...

 IDFa = log2(3 / 2) = 0.585

 So, TFa1 * IDFa = 0.585

 Wikipedia mentions an adjustment to overcome biases for long documents, by
 calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change
 anything for TFa1, as the value remains 1.

 In other words, my calculations don't agree with yours, and neither seem
 to agree with Spark :)

 Regards,
 Ashic.

 --
 Date: Thu, 30 Oct 2014 22:13:49 +
 Subject: how idf is calculated
 From: andr...@sindicetech.com
 To: u...@spark.incubator.apache.org


 Hi,
 I'm writing a paper and I need to calculate tf-idf. Whit your help I
 managed to get results, I needed, but the problem is that I need to be able
 to explain how each number was gotten. So I tried to understand how idf was
 calculated and the numbers i get don't correspond to those I should get .

 I have 3 documents (each line a document)
 a a b c m m
 e a c d e e
 d j k l m m c

 When I calculate tf, I get this
 (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])
 (1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])
 (1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]

 idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
 m -number of documents (3 in my case).
 d(t) - in how many documents is term present
 a: log(4/3) =0.1249387366
 b: log(4/2) =0.3010299957
 c: log(4/4) =0
 d: log(4/3) =0.1249387366
 e: log(4/2) =0.3010299957
 l: log(4/2) =0.3010299957
 m: log(4/3) =0.1249387366

 When I output  idf vector `
 idf.idf.toArray.filter(_.(0)).distinct.foreach(println(_)) `
 I get :
 1.3862943611198906
 0.28768207245178085
 0.6931471805599453

 I understand why there are only 3 numbers, because only 3 are unique :
 log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf
 where calculated

 Best regards,
 Andrejs

Re: how idf is calculated

Yes, here the base doesn't matter as it just multiplies all results by
a constant factor. Math libraries tend to have ln, not log10 or log2.
ln is often the more, er, natural base for several computations. So I
would assume that log = ln in the context of ML.

On Fri, Oct 31, 2014 at 11:31 AM, Andrejs Abele andr...@sindicetech.com wrote:
 I found my problem. I assumed based on TF-IDF in  Wikipedia , that log base
 10 is used, but as I found in this discussion, in scala it is actually ln
 (natural logarithm).

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Executor and BlockManager memory size

2014-10-31 Thread Gen

Hi, 

I meet the same problem in the context of spark and yarn.
When I open pyspark with the following command:

  spark/bin/pyspark --master yarn-client --num-executors 1 --executor-memory
2500m

It turns out *INFO storage.BlockManagerMasterActor: Registering block
manager ip-10-0-6-171.us-west-2.compute.internal:38770 with 1294.1 MB RAM*
So, according to the documentation, just 2156.83m is allocated to executor.
Moreover, according to yarn 3072m memory is used for this container.

Do you have any ideas about this?
Thanks a lot 

Cheers
Gen


Boromir Widas wrote
 Hey Larry,
 
 I have been trying to figure this out for standalone clusters as well.
 http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-Block-Manager-td12833.html
 has an answer as to what block manager is for.
 
 From the documentation, what I understood was if you assign X GB to each
 executor, spark.storage.memoryFraction(default 0.6) * X is assigned to the
 BlockManager and the rest for the JVM itself(?).
 However, as you see, 26.8G is assigned to the BM, and assuming 0.6
 memoryFraction, this means the executor sees ~44.7G of memory, I am not
 sure what happens to the difference(5.3G).
 
 
 On Thu, Oct 9, 2014 at 9:40 PM, Larry Xiao lt;

 xiaodi@.edu

 gt; wrote:
 
 Hi all,

 I'm confused about Executor and BlockManager, why they have different
 memory.

 14/10/10 08:50:02 INFO AppClient$ClientActor: Executor added:
 app-20141010085001-/2 on worker-20141010004933-brick6-35657
 (brick6:35657) with 6 cores
 14/10/10 08:50:02 INFO SparkDeploySchedulerBackend: Granted executor ID
 app-20141010085001-/2 on hostPort brick6:35657 with 6 cores, 50.0 GB
 RAM

 14/10/10 08:50:07 INFO BlockManagerMasterActor: Registering block
 manager
 brick6:53296 with 26.8 GB RAM

 and on the WebUI,

 Executor IDAddressRDD Blocks  Memory UsedDisk UsedActive
 TasksFailed Tasks  Complete TasksTotal TasksTask Time   
 Input
   Shuffle ReadShuffle Write
 0brick3:3760700.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 1brick1:5949300.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 2brick6:5329600.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 3brick5:3854300.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 4brick2:4493700.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 5brick4:4679800.0 B / 26.8 GB0.0 B60  06   
 0
 ms0.0 B0.0 B0.0 B
 
 driver
 brick0:5769200.0 B / 274.6 MB0.0 B  000
   00 ms0.0 B0.0 B0.0 B

 As I understand it, a worker consist of a daemon and an executor, and
 executor takes charge both execution and storage.
 So does it mean that 26.8 GB is saved for storage and the rest is for
 execution?

 Another question is that, throughout execution, it seems that the
 blockmanager is always almost free.

 14/10/05 14:33:44 INFO BlockManagerInfo: Added broadcast_21_piece0 in
 memory on brick2:57501 (size: 1669.0 B, free: 21.2 GB)

 I don't know what I'm missing here.

 Best regards,
 Larry

 -
 To unsubscribe, e-mail: 

 user-unsubscribe@.apache

 For additional commands, e-mail: 

 user-help@.apache








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-and-BlockManager-memory-size-tp16091p17816.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sbt/sbt compile error [FATAL]

2014-10-31 Thread HansPeterS

Hi
Thanks 
Branch 1.1. did not work but 1.0 worked.
Why could that be?

Regards Hans-Peter



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-compile-error-FATAL-tp17629p17817.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SQL COUNT DISTINCT

2014-10-31 Thread Bojan Kostic

While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
Map partitions phase finished fast, but collect phase is slow.
It's only runs on single executor.
Should this run this way?

And here is the simple code which i use for testing:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/)
parquetFile.registerTempTable(parquetFile)
val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile)
count.map(t = t(0)).collect().foreach(println)

I guess because of the distinct process must be on single node. But i wonder
can i add some parallelism to the collect process.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread Ganelin, Ilya

Hi Jan. I've actually written a function recently to do precisely that using 
the RDD.randomSplit function. You just need to calculate how big each element 
of your data is, then how many of each data can fit in each RDD to populate the 
input to rqndomSplit. Unfortunately, in my case I wind up with GC errors on 
large data doing this and am still debugging :)

-Original Message-
From: jan.zi...@centrum.cz [jan.zi...@centrum.czmailto:jan.zi...@centrum.cz]
Sent: Friday, October 31, 2014 06:27 AM Eastern Standard Time
To: user@spark.apache.org
Subject: Repartitioning by partition size, not by number of partitions.

Hi,

I have inpot data that are many of very small files containing one .json.

For performance reasons (I use PySpark) I have to do repartioning, currently I 
do:

sc.textFile(files).coalesce(100))

Problem is that I have to guess the number of partitions in a such way that 
it's as fast as possible and I am still on the sefe side with the RAM memory. 
So this is quiet difficult.

For this reason I would like to ask if there is some way, how to replace 
coalesce(100) by something that creates N partitions of the given size? I went 
through the documentation, but I was not able to find some way, how to do that.

thank you in advance for any help or advice.

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

unsubscribe

2014-10-31 Thread Hongbin Liu

Apology for having to send to all.

I am highly interested in spark, would like to stay in this mailing list. But 
the email I signed up is not right one. The link below to unsubscribe seems not 
working.

https://spark.apache.org/community.html

Can anyone help?


This message may contain confidential information and is intended for specific 
recipients unless explicitly noted otherwise. If you have reason to believe you 
are not an intended recipient of this message, please delete it and notify the 
sender. This message may not represent the opinion of Intercontinental 
Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a 
contract or guarantee. Unencrypted electronic mail is not secure and the 
recipient of this message is expected to provide safeguards from viruses and 
pursue alternate means of communication where privacy or a binding message is 
desired.

Re: unsubscribe

2014-10-31 Thread Corey Nolet

Hongbin,

Please send an email to user-unsubscr...@spark.apache.org in order to
unsubscribe from the user list.

On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote:

  Apology for having to send to all.



 I am highly interested in spark, would like to stay in this mailing list.
 But the email I signed up is not right one. The link below to unsubscribe
 seems not working.



 https://spark.apache.org/community.html



 Can anyone help?

 --
 This message may contain confidential information and is intended for
 specific recipients unless explicitly noted otherwise. If you have reason
 to believe you are not an intended recipient of this message, please delete
 it and notify the sender. This message may not represent the opinion of
 Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and
 does not constitute a contract or guarantee. Unencrypted electronic mail is
 not secure and the recipient of this message is expected to provide
 safeguards from viruses and pursue alternate means of communication where
 privacy or a binding message is desired.

Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Bill Q

Hi,
I am trying to make Spark SQL 1.1 to work to replace part of our ETL
processes that are currently done by Hive 0.12.

A common problem that I have encountered is the Too many files open
error. Once that happened, the query just failed. I started the
spark-shell by using ulimit -n 4096  spark-shell. And it still pops the
same error.

Any solutions?

Many thanks.


Bill



-- 
Many thanks.


Bill

Re: CANNOT FIND ADDRESS

2014-10-31 Thread akhandeshi

Thanks for the pointers! I did tried but didn't seem to help...

In my latest try, I am doing spark-submit local

But see the same message in  spark App ui (4040)
localhost   CANNOT FIND ADDRESS

In the logs, I see a lot of in-memory map to disk.  I don't understand why
that is the case.  There should be over 35 gb ram available for input that
is not significantly large.  It may be link to the performance issues that I
am seeing.  I have another post for seeing advice on  that.  It seems, I am
not able to tune spark sufficiently to execute my process successfully.

14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory
map of 1777 MB to disk (2 times so
 far)
14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory
map of 393 MB to disk (1 time so f
ar)
14/10/31 13:45:12 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory
map of 392 MB to disk (2 times so 
far)
14/10/31 13:45:14 INctorsBySecId();^ZFO ExternalAppendOnlyMap: Thread 230
spilling in-memory map of 554 MB to 
disk (2 times so far)
14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 235 spilling in-memory
map of 3990 MB to disk (1 time so 
far)
14/10/31 13:45:15 INFO ExternalAppendOnlyMap: Thread 236 spilling in-memory
map of 2667 MB to disk (1 time so 
far)
14/10/31 13:45:17 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory
map of 825 MB to disk (3 times so 
far)
14/10/31 13:45:24 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory
map of 4618 MB to disk (2 times so
 far)
14/10/31 13:45:26 INFO ExternalAppendOnlyMap: Thread 233 spilling in-memory
map of 15869 MB to disk (1 time so
 far)
14/10/31 13:45:37 INFO ExternalAppendOnlyMap: Thread 206 spilling in-memory
map of 3026 MB to disk (3 times so
 far)
14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 228 spilling in-memory
map of 401 MB to disk (3 times so 
far)
14/10/31 13:45:48 INFO ExternalAppendOnlyMap: Thread 259 spilling in-memory
map of 392 MB to disk (4 times so 

My spark properties are:

NameValue
spark.akka.frameSize50
spark.akka.timeout  900
spark.app.name  Simple File Merge Application
spark.core.connection.ack.wait.timeout  900
spark.default.parallelism   10
spark.driver.host   spark-single.c.fi-mdd-poc.internal
spark.driver.memory 35g
spark.driver.port   40255
spark.eventLog.enabled  true
spark.fileserver.urihttp://10.240.106.135:59255
spark.jars  file:/home/ami_khandeshi_gmail_com/SparkExample-1.0.jar
spark.masterlocal[16]
spark.scheduler.modeFIFO
spark.shuffle.consolidateFiles  true
spark.storage.memoryFraction0.3
spark.tachyonStore.baseDir  /tmp
spark.tachyonStore.folderName   spark-21ad0fd2-2177-48ce-9242-8dbb33f2a1f1
spark.tachyonStore.url  tachyon://mdd:19998



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CANNOT-FIND-ADDRESS-tp17637p17824.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Too many files open with Spark 1.1 and CDH 5.1

It's almost surely the workers, not the driver (shell) that have too
many files open. You can change their ulimit. But it's probably better
to see why it happened -- a very big shuffle? -- and repartition or
design differently to avoid it. The new sort-based shuffle might help
in this regard.

On Fri, Oct 31, 2014 at 3:25 PM, Bill Q bill.q@gmail.com wrote:
 Hi,
 I am trying to make Spark SQL 1.1 to work to replace part of our ETL
 processes that are currently done by Hive 0.12.

 A common problem that I have encountered is the Too many files open error.
 Once that happened, the query just failed. I started the spark-shell by
 using ulimit -n 4096  spark-shell. And it still pops the same error.

 Any solutions?

 Many thanks.


 Bill



 --
 Many thanks.


 Bill


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkContext.stop() ?

2014-10-31 Thread ll

what is it for?  when do we call it?

thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Bill Q

Hi Sean,
Thanks for the reply. I think both driver and worker have the problem. You
are right that the ulimit fixed the driver side too many files open error.

And there is a very big shuffle. My maybe naive thought is to migrate the
HQL scripts directly from Hive to Spark SQL and make them  work. It seems
that it won't be that easy. Is that correct? And it seems that I had done
that with Shark and it worked pretty well in the old days.

Any suggestions if we are planning to migrate a large code base from
Hive to Spark SQL with minimum code rewriting?

Many thanks.


Cao

On Friday, October 31, 2014, Sean Owen so...@cloudera.com wrote:

 It's almost surely the workers, not the driver (shell) that have too
 many files open. You can change their ulimit. But it's probably better
 to see why it happened -- a very big shuffle? -- and repartition or
 design differently to avoid it. The new sort-based shuffle might help
 in this regard.

 On Fri, Oct 31, 2014 at 3:25 PM, Bill Q bill.q@gmail.com
 javascript:; wrote:
  Hi,
  I am trying to make Spark SQL 1.1 to work to replace part of our ETL
  processes that are currently done by Hive 0.12.
 
  A common problem that I have encountered is the Too many files open
 error.
  Once that happened, the query just failed. I started the spark-shell by
  using ulimit -n 4096  spark-shell. And it still pops the same error.
 
  Any solutions?
 
  Many thanks.
 
 
  Bill
 
 
 
  --
  Many thanks.
 
 
  Bill
 



-- 
Many thanks.


Bill

Re: does updateStateByKey accept a state that is a tuple?

2014-10-31 Thread spr

Based on execution on small test cases, it appears that the construction
below does what I intend.  (Yes, all those Tuple1()s were superfluous.)

var lines =  ssc.textFileStream(dirArg) 
var linesArray = lines.map( line = (line.split(\t))) 
var newState = linesArray.map( lineArray = ((lineArray(4),
   (1, Time((lineArray(0).toDouble*1000).toLong),
 Time((lineArray(0).toDouble*1000).toLong)))  ))

val updateDhcpState = (newValues: Seq[(Int, Time, Time)], state:
Option[(Int, Time, Time)]) = 
Option[(Int, Time, Time)]
{ 
  val newCount = newValues.map( x = x._1).sum 
  val newMinTime = newValues.map( x = x._2).min 
  val newMaxTime = newValues.map( x = x._3).max 
  val (count, minTime, maxTime) = state.getOrElse((0,
Time(Int.MaxValue), Time(Int.MinValue))) 

  (count+newCount, Seq(minTime, newMinTime).min, Seq(maxTime,
newMaxTime).max) 
} 

var DhcpSvrCum = newState.updateStateByKey[(Int, Time,
Time)](updateDhcpState) 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/does-updateStateByKey-accept-a-state-that-is-a-tuple-tp17756p17828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Out of memory with Spark Streaming

2014-10-31 Thread Aniket Bhatnagar

Thanks Chris for looking at this. I was putting data at roughly the same 50
records per batch max. This issue was purely because of a bug in my
persistence logic that was leaking memory.

Overall, I haven't seen a lot of lag with kinesis + spark setup and I am
able to process records at roughly the same rate as data as fed into
kinesis with acceptable latency.

Thanks,
Aniket
On Oct 31, 2014 1:15 AM, Chris Fregly ch...@fregly.com wrote:

 curious about why you're only seeing 50 records max per batch.

 how many receivers are you running?  what is the rate that you're putting
 data onto the stream?

 per the default AWS kinesis configuration, the producer can do 1000 PUTs
 per second with max 50k bytes per PUT and max 1mb per second per shard.

 on the consumer side, you can only do 5 GETs per second and 2mb per second
 per shard.

 my hunch is that the 5 GETs per second is what's limiting your consumption
 rate.

 can you verify that these numbers match what you're seeing?  if so, you
 may want to increase your shards and therefore the number of kinesis
 receivers.

 otherwise, this may require some further investigation on my part.  i
 wanna stay on top of this if it's an issue.

 thanks for posting this, aniket!

 -chris

 On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 Hi all

 Sorry but this was totally my mistake. In my persistence logic, I was
 creating async http client instance in RDD foreach but was never closing it
 leading to memory leaks.

 Apologies for wasting everyone's time.

 Thanks,
 Aniket

 On 12 September 2014 02:20, Tathagata Das tathagata.das1...@gmail.com
 wrote:

 Which version of spark are you running?

 If you are running the latest one, then could try running not a window
 but a simple event count on every 2 second batch, and see if you are still
 running out of memory?

 TD


 On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I did change it to be 1 gb. It still ran out of memory but a little
 later.

 The streaming job isnt handling a lot of data. In every 2 seconds, it
 doesn't get more than 50 records. Each record size is not more than 500
 bytes.
  On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com
 wrote:

 You could set spark.executor.memory to something bigger than the
 default (512mb)


 On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am running a simple Spark Streaming program that pulls in data from
 Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, 
 maps
 data and persists to a store.

 The program is running in local mode right now and runs out of memory
 after a while. I am yet to investigate heap dumps but I think Spark isn't
 releasing memory after processing is complete. I have even tried changing
 storage level to disk only.

 Help!

 Thanks,
 Aniket

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes


Hi Ilya,

This seems to me as quiet complicated solution, I'm thinking that easier 
(though not optimal) solution might be for example to use heuristicaly 
something like RDD.coalesce(RDD.getNumPartitions() / N), but it keeps me wonder 
that Spark does not have something like RDD.coalesce(partition_size).
 
__


Hi Jan. I've actually written a function recently to do precisely that using 
the RDD.randomSplit function. You just need to calculate how big each element 
of your data is, then how many of each data can fit in each RDD to populate the 
input to rqndomSplit. Unfortunately, in my case I wind up with GC errors on 
large data doing this and am still debugging :)

-Original Message-
From: jan.zi...@centrum.cz [jan.zi...@centrum.cz jan.zi...@centrum.cz]
Sent: Friday, October 31, 2014 06:27 AM Eastern Standard Time
To: user@spark.apache.org
Subject: Repartitioning by partition size, not by number of partitions.


Hi,

I have inpot data that are many of very small files containing one .json.

For performance reasons (I use PySpark) I have to do repartioning, currently I 
do:

sc.textFile(files).coalesce(100))
 
Problem is that I have to guess the number of partitions in a such way that 
it's as fast as possible and I am still on the sefe side with the RAM memory. 
So this is quiet difficult.

For this reason I would like to ask if there is some way, how to replace 
coalesce(100) by something that creates N partitions of the given size? I went 
through the documentation, but I was not able to find some way, how to do that.

thank you in advance for any help or advice. 
 
 
The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-31 Thread jan.zikes


Yes I would expect it as you say, setting executor-cores as 1 would work, but 
it seems to me that when I do use executor-cores=1 than it does actually 
perform more than one job on each of the machines at one time moment (at least 
based on what top says).
__



CC: user@spark.apache.org

It's not very difficult to implement by properly set parameter of 
application.Some basic knowledge you should know: An application can have only 
one executor at each machine or container (YARN).So you just set executor-cores 
as 1, then each executor will make only one task at once. 
2014-10-28 19:00 GMT+08:00 jan.zi...@centrum.cz jan.zi...@centrum.cz:
But I guess that this makes only one task over all the clusters nodes. I would 
like to run several tasks, but I would like Spark to not run more than one map 
at each of my nodes at one time. That means I would like to let's say have 4 
different tasks and 2 nodes where each node has 2 cores. Currently hadoop runs 
2 maps in parallel at each node (all the 4 tasks in parallel), but I would like 
to somehow force it to run only 1 task at each node and to give it another task 
after the first task will finish.  
 
__


The number of tasks is decided by the input partition numbers.If you want only 
one map or flatMap at once, just call coalesce() or repartition() to associate 
data into one partition.However, this is not recommend because it was not 
executed parallel efficiently.
2014-10-28 17:27 GMT+08:00 jan.zi...@centrum.cz jan.zi...@centrum.cz:
Hi,

I am currently struggling with how to properly set Spark to perform only one 
map, flatMap, etc at once. In other words my map uses multi core algorithm so I 
would like to have only one map running to be able to use all the machine cores.

Thank you in advance for advices and replies. 

Jan 
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org 
user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark streaming - saving kafka DStream into hadoop throws exception

Hm, now I am also seeing this problem.

The essence of my code is:

final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);

JavaStreamingContextFactory streamingContextFactory = new
JavaStreamingContextFactory() {
  @Override
  public JavaStreamingContext create() {
return new JavaStreamingContext(sparkContext, new
Duration(batchDurationMS));
  }
};

  streamingContext = JavaStreamingContext.getOrCreate(
  checkpointDirString, sparkContext.hadoopConfiguration(),
streamingContextFactory, false);
  streamingContext.checkpoint(checkpointDirString);

yields:

2014-10-31 14:29:00,211 ERROR OneForOneStrategy:66
org.apache.hadoop.conf.Configuration

- field (class 
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
name: conf$2, type: class org.apache.hadoop.conf.Configuration)

- object (class
org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9,
function2)

- field (class org.apache.spark.streaming.dstream.ForEachDStream,
name: org$apache$spark$streaming$dstream$ForEachDStream$$foreachFunc,
type: interface scala.Function2)

- object (class org.apache.spark.streaming.dstream.ForEachDStream,
org.apache.spark.streaming.dstream.ForEachDStream@cb8016a)
...


This looks like it's due to PairRDDFunctions, as this saveFunc seems
to be org.apache.spark.streaming.dstream.PairDStreamFunctions$$anonfun$9
:

def saveAsNewAPIHadoopFiles(
prefix: String,
suffix: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[_ : NewOutputFormat[_, _]],
conf: Configuration = new Configuration
  ) {
  val saveFunc = (rdd: RDD[(K, V)], time: Time) = {
val file = rddToFileName(prefix, suffix, time)
rdd.saveAsNewAPIHadoopFile(file, keyClass, valueClass,
outputFormatClass, conf)
  }
  self.foreachRDD(saveFunc)
}


conf indeed is serialized to make it part of saveFunc, no? but it
can't be serialized.
But surely this doesn't fail all the time or someone would have
noticed by now...

It could be a particular closure problem again.

Any ideas on whether this is a problem, or if there's a workaround?
checkpointing does not work at all for me as a result.

On Fri, Aug 15, 2014 at 10:37 PM, salemi alireza.sal...@udo.edu wrote:
 Hi All,

 I am just trying to save the kafka dstream to hadoop as followed

   val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap)
   dStream.saveAsHadoopFiles(hdfsDataUrl, data)

 It throws the following exception. What am I doing wrong?

 14/08/15 14:30:09 ERROR OneForOneStrategy: org.apache.hadoop.mapred.JobConf
 java.io.NotSerializableException: org.apache.hadoop.mapred.JobConf
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
 at
 java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
 at
 org.apache.spark.streaming.DStreamGraph.writeObject(DStreamGraph.scala:168)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
 at
 java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
 at
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 at
 java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)

A Spark Design Problem

2014-10-31 Thread Steve Lewis

 The original problem is in biology but the following captures the CS
issues, Assume I  have a large number of locks and a large number of keys.
There is a scoring function between keys and locks and a key that  fits a
lock will have a high score. There may be many keys fitting one lock and a
key may fit no locks well. The object is to find the best fitting lock for
each key.

Assume that the number of keys and locks is high enough that taking the
cartesian product of the two is computationally impractical. Also assume
that keys and locks have an attached location which is accurate within an
error (say 1 Km). Only keys and locks within 1 Km need be compared.
Now assume I can create a JavaRDDKeys and a JavaRDDLocks . I could
divide the locations into 1 Km squared bins and look only within a few
bins. Assume that it is practical to take a cartesian product for all
elements in a bin but not to keep all elements in memory. I could map my
RDDs into PairRDDs where the key is the bin assigned by location

I know how to take the cartesian product of two JavaRDDs but not how to
take a cartesian product of sets of elements sharing a common key (bin),
Any suggestions. Assume that in the worst cases the number of elements in a
bin are too large to keep in memory although if a bin were subdivided into,
say 100 subbins elements would fit in memory.

Any thoughts as to how to attack the problem

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Nicholas Chammas

As Sean suggested, try out the new sort-based shuffle in 1.1 if you know
you're triggering large shuffles. That should help a lot.

2014년 10월 31일 금요일, Bill Qbill.q@gmail.com님이 작성한 메시지:

 Hi Sean,
 Thanks for the reply. I think both driver and worker have the problem. You
 are right that the ulimit fixed the driver side too many files open error.

 And there is a very big shuffle. My maybe naive thought is to migrate the
 HQL scripts directly from Hive to Spark SQL and make them  work. It seems
 that it won't be that easy. Is that correct? And it seems that I had done
 that with Shark and it worked pretty well in the old days.

 Any suggestions if we are planning to migrate a large code base from
 Hive to Spark SQL with minimum code rewriting?

 Many thanks.


 Cao

 On Friday, October 31, 2014, Sean Owen so...@cloudera.com
 javascript:_e(%7B%7D,'cvml','so...@cloudera.com'); wrote:

 It's almost surely the workers, not the driver (shell) that have too
 many files open. You can change their ulimit. But it's probably better
 to see why it happened -- a very big shuffle? -- and repartition or
 design differently to avoid it. The new sort-based shuffle might help
 in this regard.

 On Fri, Oct 31, 2014 at 3:25 PM, Bill Q bill.q@gmail.com wrote:
  Hi,
  I am trying to make Spark SQL 1.1 to work to replace part of our ETL
  processes that are currently done by Hive 0.12.
 
  A common problem that I have encountered is the Too many files open
 error.
  Once that happened, the query just failed. I started the spark-shell by
  using ulimit -n 4096  spark-shell. And it still pops the same error.
 
  Any solutions?
 
  Many thanks.
 
 
  Bill
 
 
 
  --
  Many thanks.
 
 
  Bill
 



 --
 Many thanks.


 Bill

Re: Measuring Performance in Spark

2014-10-31 Thread mahsa

Is there any tools like Ganglia that I can use to get performance on Spark or
I need to do it myself?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Measuring Performance in Spark

2014-10-31 Thread Otis Gospodnetic

Hi Mahsa,

Use SPM http://sematext.com/spm/.  See
http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ .

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Oct 31, 2014 at 1:00 PM, mahsa mahsa.han...@gmail.com wrote:

 Is there any tools like Ganglia that I can use to get performance on Spark
 or
 I need to do it myself?

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17836.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

properties file on a spark cluster

2014-10-31 Thread Daniel Takabayashi

Hi Guys,

I'm trying to execute a spark job using python, running on a cluster of
Yarn (managed by cloudera manager). The python script is using a set of
python programs installed in each member of cluster. This set of programs
need an property file found by a local system path.

My problem is:  When this script is sent, using spark-submit, the programs
can't find this properties file. Running locally as stand-alone job, is no
problem, the properties file is found.

My questions is:

1 - What is the problem here ?
2 - In this scenario (an script running on a spark yarn cluster that use
python programs that share same properties file) what is the best approach ?


Thank's
taka

Re: Measuring Performance in Spark

2014-10-31 Thread mahsa

Oh this is Awesome! exactly what I needed! Thank you Otis!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkContext.stop() ?

2014-10-31 Thread Daniel Siegmann

It is used to shut down the context when you're done with it, but if you're
using a context for the lifetime of your application I don't think it
matters.

I use this in my unit tests, because they start up local contexts and you
can't have multiple local contexts open so each test must stop its context
when it's done.

On Fri, Oct 31, 2014 at 11:12 AM, ll duy.huynh@gmail.com wrote:

 what is it for?  when do we call it?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: SQL COUNT DISTINCT

2014-10-31 Thread Nicholas Chammas

The only thing in your code that cannot be parallelized is the collect()
because -- by definition -- it collects all the results to the driver node.
This has nothing to do with the DISTINCT in your query.

What do you want to do with the results after you collect them? How many
results do you have in the output of collect?

Perhaps it makes more sense to continue operating on the RDDs you have or
saving them using one of the RDD methods, because that preserves the
cluster's ability to parallelize work.

Nick

2014년 10월 31일 금요일, Bojan Kosticblood9ra...@gmail.com님이 작성한 메시지:

 While i testing Spark SQL i noticed that COUNT DISTINCT works really slow.
 Map partitions phase finished fast, but collect phase is slow.
 It's only runs on single executor.
 Should this run this way?

 And here is the simple code which i use for testing:
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val parquetFile = sqlContext.parquetFile(/bojan/test/2014-10-20/)
 parquetFile.registerTempTable(parquetFile)
 val count = sqlContext.sql(SELECT COUNT(DISTINCT f2) FROM parquetFile)
 count.map(t = t(0)).collect().foreach(println)

 I guess because of the distinct process must be on single node. But i
 wonder
 can i add some parallelism to the collect process.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;

Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab

Hi,

I am using the latest Cassandra-Spark Connector  to access Cassandra tables
form Spark. While I successfully managed to connect Cassandra using
CassandraRDD, the similar SparkSQL approach does not work. Here is my code
for both methods:

import com.datastax.spark.connector._

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql._;

import org.apache.spark.SparkContext._

import org.apache.spark.sql.catalyst.expressions._

import com.datastax.spark.connector.cql.CassandraConnector

import org.apache.spark.sql.cassandra.CassandraSQLContext


  val conf = new SparkConf().setAppName(SomethingElse)

   .setMaster(local)

.set(spark.cassandra.connection.host, localhost)

val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable(mydb, mytable)  // this works

But:

val cc = new CassandraSQLContext(sc)

 cc.setKeyspace(mydb)

 val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )

println (count :  +  srdd.count) // does not work

Exception is thrown:

Exception in thread main
com.google.common.util.concurrent.UncheckedExecutionException:
java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)




in fact mydb3 is anothery keyspace which I did not tried even to connect to
it !


Any idea?


best,

/Shahab


Here is how my SBT looks like:

libraryDependencies ++= Seq(

com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
withSources() withJavadoc(),

org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

net.jpountz.lz4 % lz4 % 1.2.0,

org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, slf4j-
api) exclude(javax.servlet, servlet-api),

com.datastax.cassandra % cassandra-driver-core % 2.0.4
intransitive(),

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(org.apache.hadoop, hadoop-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided,

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

com.github.nscala-time %% nscala-time % 1.0.0,

org.scalatest %% scalatest % 1.9.1 % test,

org.apache.spark %% spark-sql % 1.1.0 %  provided,

org.apache.spark %% spark-hive % 1.1.0 % provided,

org.json4s %% json4s-jackson % 3.2.5,

junit % junit % 4.8.1 % test,

org.slf4j % slf4j-api % 1.7.7,

org.slf4j % slf4j-simple % 1.7.7,

org.clapper %% grizzled-slf4j % 1.0.2,

log4j % log4j % 1.2.17)

Re: Manipulating RDDs within a DStream

Hi Harold,
Can you include the versions of spark and spark-cassandra-connector you are 
using?

Thanks!

Helena
@helenaedelson

On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote:

 Hi all,
 
 I'd like to be able to modify values in a DStream, and then send it off to an 
 external source like Cassandra, but I keep getting Serialization errors and 
 am not sure how to use the correct design pattern. I was wondering if you 
 could help me.
 
 I'd like to be able to do the following:
 
  wordCounts.foreachRDD( rdd = {
 
val arr = record.toArray
...
 
 })
 
 I would like to use the arr to send back to cassandra, for instance:
 
 Use it like this:
 
 val collection = sc.parallelize(Seq(a.head._1, a.head_.2))
 collection.saveToCassandra()
 
 Or something like that, but as you know, I can't do this within the 
 foreacRDD but only at the driver level. How do I use the arr variable to 
 do something like that ?
 
 Thanks for any help,
 
 Harold
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Harold Nguyen

Thanks Lalit, and Helena,

What I'd like to do is manipulate the values within a DStream like this:

DStream.foreachRDD( rdd = {

   val arr = record.toArray

}

I'd then like to be able to insert results from the arr back into
Cassadnra, after I've manipulated the arr array.
However, for all the examples I've seen, inserting into Cassandra is
something like:

val collection = sc.parralellize(Seq(foo, bar)))

Where foo and bar could be elements in the arr array. So I would like
to know how to insert into Cassandra at the worker level.

Best wishes,

Harold

On Thu, Oct 30, 2014 at 11:48 PM, lalit1303 la...@sigmoidanalytics.com
wrote:

 Hi,

 Since, the cassandra object is not serializable you can't open the
 connection on driver level and access the object inside foreachRDD (i.e. at
 worker level).
 You have to open connection inside foreachRDD only, perform the operation
 and then close the connection.

 For example:

  wordCounts.foreachRDD( rdd = {

val arr = rdd.toArray

OPEN cassandra connection
store arr
CLOSE cassandra connection

 })


 Thanks



 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Accessing Cassandra with SparkSQL, Does not work?

Hi Shahab,

I’m just curious, are you explicitly needing to use thrift? Just using the 
connector with spark does not require any thrift dependencies.
Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1”

But to your question, you declare the keyspace but also unnecessarily repeat 
the keyspace.table in your select.
Try this instead:

val cc = new CassandraSQLContext(sc)
cc.setKeyspace(“keyspaceName)
val result = cc.sql(SELECT * FROM tableName”) etc

- Helena
@helenaedelson

On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,
 
 I am using the latest Cassandra-Spark Connector  to access Cassandra tables 
 form Spark. While I successfully managed to connect Cassandra using 
 CassandraRDD, the similar SparkSQL approach does not work. Here is my code 
 for both methods:
 
 import com.datastax.spark.connector._
 
 import org.apache.spark.{SparkConf, SparkContext}
 
 import org.apache.spark.sql._;
 
 import org.apache.spark.SparkContext._
 
 import org.apache.spark.sql.catalyst.expressions._
 
 import com.datastax.spark.connector.cql.CassandraConnector
 
 import org.apache.spark.sql.cassandra.CassandraSQLContext
 
 
 
   val conf = new SparkConf().setAppName(SomethingElse)  
 
.setMaster(local)
 
 .set(spark.cassandra.connection.host, localhost)
 
 val sc: SparkContext = new SparkContext(conf)
 
 
 val rdd = sc.cassandraTable(mydb, mytable)  // this works
 
 But:
 
 val cc = new CassandraSQLContext(sc)
 
  cc.setKeyspace(mydb)
 
  val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )   
 
 
 println (count :  +  srdd.count) // does not work
 
 Exception is thrown:
 
 Exception in thread main 
 com.google.common.util.concurrent.UncheckedExecutionException: 
 java.util.NoSuchElementException: key not found: mydb3.inverseeventtype
 
 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
 
 at com.google.common.cache.LocalCache.get(LocalCache.java:3934)
 
 
 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)
 
 
 
 
 
 in fact mydb3 is anothery keyspace which I did not tried even to connect to 
 it !  
 
 
 
 Any idea?
 
 
 
 best,
 
 /Shahab
 
 
 
 Here is how my SBT looks like: 
 
 libraryDependencies ++= Seq(
 
 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 
 withSources() withJavadoc(),
 
 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),
 
 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),
 
 net.jpountz.lz4 % lz4 % 1.2.0,
 
 org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, 
 slf4j-api) exclude(javax.servlet, servlet-api),
 
 com.datastax.cassandra % cassandra-driver-core % 2.0.4 
 intransitive(),
 
 org.apache.spark %% spark-core % 1.1.0 % provided 
 exclude(org.apache.hadoop, hadoop-core),
 
 org.apache.spark %% spark-streaming % 1.1.0 % provided,
 
 org.apache.hadoop % hadoop-client % 1.0.4 % provided,
 
 com.github.nscala-time %% nscala-time % 1.0.0,
 
 org.scalatest %% scalatest % 1.9.1 % test,
 
 org.apache.spark %% spark-sql % 1.1.0 %  provided,
 
 org.apache.spark %% spark-hive % 1.1.0 % provided,
 
 org.json4s %% json4s-jackson % 3.2.5,
 
 junit % junit % 4.8.1 % test,
 
 org.slf4j % slf4j-api % 1.7.7,
 
 org.slf4j % slf4j-simple % 1.7.7,
 
 org.clapper %% grizzled-slf4j % 1.0.2,
 
 
 log4j % log4j % 1.2.17)

Re: A Spark Design Problem

2014-10-31 Thread francois . garillot

Hi Steve,

Are you talking about sequence alignment ?



—
FG

On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com
wrote:

  The original problem is in biology but the following captures the CS
 issues, Assume I  have a large number of locks and a large number of keys.
 There is a scoring function between keys and locks and a key that  fits a
 lock will have a high score. There may be many keys fitting one lock and a
 key may fit no locks well. The object is to find the best fitting lock for
 each key.
 Assume that the number of keys and locks is high enough that taking the
 cartesian product of the two is computationally impractical. Also assume
 that keys and locks have an attached location which is accurate within an
 error (say 1 Km). Only keys and locks within 1 Km need be compared.
 Now assume I can create a JavaRDDKeys and a JavaRDDLocks . I could
 divide the locations into 1 Km squared bins and look only within a few
 bins. Assume that it is practical to take a cartesian product for all
 elements in a bin but not to keep all elements in memory. I could map my
 RDDs into PairRDDs where the key is the bin assigned by location
 I know how to take the cartesian product of two JavaRDDs but not how to
 take a cartesian product of sets of elements sharing a common key (bin),
 Any suggestions. Assume that in the worst cases the number of elements in a
 bin are too large to keep in memory although if a bin were subdivided into,
 say 100 subbins elements would fit in memory.
 Any thoughts as to how to attack the problem

Re: Manipulating RDDs within a DStream

Hi Harold,

Yes, that is the problem :) Sorry for the confusion, I will make this clear in 
the docs ;) since master is work for the next version.

All you need to do is use 
spark 1.1.0 as you have it already
com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1”
and assembly - not from master, checkout branch b1.1, and sbt ;clean ;reload 
;assembly

Cheers,
- Helena
@helenaedelson


On Oct 31, 2014, at 1:35 PM, Harold Nguyen har...@nexgate.com wrote:

 Hi Helena,
 
 Thanks very much ! I'm using Spark 1.1.0, and 
 spark-cassandra-connector-assembly-1.2.0-SNAPSHOT
 
 Best wishes,
 
 Harold
 
 On Fri, Oct 31, 2014 at 10:31 AM, Helena Edelson 
 helena.edel...@datastax.com wrote:
 Hi Harold,
 Can you include the versions of spark and spark-cassandra-connector you are 
 using?
 
 Thanks!
 
 Helena
 @helenaedelson
 
 On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote:
 
  Hi all,
 
  I'd like to be able to modify values in a DStream, and then send it off to 
  an external source like Cassandra, but I keep getting Serialization errors 
  and am not sure how to use the correct design pattern. I was wondering if 
  you could help me.
 
  I'd like to be able to do the following:
 
   wordCounts.foreachRDD( rdd = {
 
 val arr = record.toArray
 ...
 
  })
 
  I would like to use the arr to send back to cassandra, for instance:
 
  Use it like this:
 
  val collection = sc.parallelize(Seq(a.head._1, a.head_.2))
  collection.saveToCassandra()
 
  Or something like that, but as you know, I can't do this within the 
  foreacRDD but only at the driver level. How do I use the arr variable 
  to do something like that ?
 
  Thanks for any help,
 
  Harold

Re: Manipulating RDDs within a DStream

Hi Harold,
This is a great use case, and here is how you could do it, for example, with 
Spark Streaming:

Using a Kafka stream:
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L50

Save raw data to Cassandra from that stream
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L56

Do n-computations on that streaming data: reading from Kafka, computing in 
Spark, and writing to Cassandra 
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L69-L71

I hope that helps, and if not I’ll dig up another.

- Helena
@helenaedelson

On Oct 31, 2014, at 1:37 PM, Harold Nguyen har...@nexgate.com wrote:

 Thanks Lalit, and Helena,
 
 What I'd like to do is manipulate the values within a DStream like this:
 
 DStream.foreachRDD( rdd = {
 
val arr = record.toArray
  
 }
 
 I'd then like to be able to insert results from the arr back into Cassadnra, 
 after I've manipulated the arr array.
 However, for all the examples I've seen, inserting into Cassandra is 
 something like:
 
 val collection = sc.parralellize(Seq(foo, bar)))
 
 Where foo and bar could be elements in the arr array. So I would like to 
 know how to insert into Cassandra at the worker level.
 
 Best wishes,
 
 Harold
 
 On Thu, Oct 30, 2014 at 11:48 PM, lalit1303 la...@sigmoidanalytics.com 
 wrote:
 Hi,
 
 Since, the cassandra object is not serializable you can't open the
 connection on driver level and access the object inside foreachRDD (i.e. at
 worker level).
 You have to open connection inside foreachRDD only, perform the operation
 and then close the connection.
 
 For example:
 
  wordCounts.foreachRDD( rdd = {
 
val arr = rdd.toArray
 
OPEN cassandra connection
store arr
CLOSE cassandra connection
 
 })
 
 
 Thanks
 
 
 
 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Manipulating-RDDs-within-a-DStream-tp17740p17800.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab

Thanks Helena.
I tried setting the KeySpace, but I got same result. I also removed other
Cassandra dependencies,  but still same exception!
I also tried to see if this setting appears in the CassandraSQLContext or
not, so I printed out the output of configustion

val cc = new CassandraSQLContext(sc)

cc.setKeyspace(mydb)

cc.conf.getAll.foreach(f = println (f._1  +  :  +  f._2))

printout:

spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2

spark.driver.host :192.168.1.111

spark.cassandra.connection.host : localhost

spark.cassandra.input.split.size : 1

spark.app.name : SomethingElse

spark.fileserver.uri :  http://192.168.1.111:51463

spark.driver.port : 51461

spark.master :  local

Does it have anything to do with the version of Apache Cassandra that I
use?? I use apache-cassandra-2.1.0


best,
/Shahab

The shortened SBT :

com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
withSources() withJavadoc(),

net.jpountz.lz4 % lz4 % 1.2.0,

org.apache.spark %% spark-core % 1.1.0 % provided
exclude(org.apache.hadoop, hadoop-core),

org.apache.spark %% spark-streaming % 1.1.0 % provided,

org.apache.hadoop % hadoop-client % 1.0.4 % provided,

com.github.nscala-time %% nscala-time % 1.0.0,

org.scalatest %% scalatest % 1.9.1 % test,

org.apache.spark %% spark-sql % 1.1.0 %  provided,

org.apache.spark %% spark-hive % 1.1.0 % provided,

org.json4s %% json4s-jackson % 3.2.5,

junit % junit % 4.8.1 % test,

org.slf4j % slf4j-api % 1.7.7,

org.slf4j % slf4j-simple % 1.7.7,

org.clapper %% grizzled-slf4j % 1.0.2,

log4j % log4j % 1.2.17

On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Hi Shahab,

 I’m just curious, are you explicitly needing to use thrift? Just using the
 connector with spark does not require any thrift dependencies.
 Simply: com.datastax.spark %% spark-cassandra-connector %
 1.1.0-beta1”

 But to your question, you declare the keyspace but also unnecessarily
 repeat the keyspace.table in your select.
 Try this instead:

 val cc = new CassandraSQLContext(sc)
 cc.setKeyspace(“keyspaceName)
 val result = cc.sql(SELECT * FROM tableName”) etc

 - Helena
 @helenaedelson

 On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am using the latest Cassandra-Spark Connector  to access Cassandra
 tables form Spark. While I successfully managed to connect Cassandra using
 CassandraRDD, the similar SparkSQL approach does not work. Here is my code
 for both methods:

 import com.datastax.spark.connector._

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.sql._;

 import org.apache.spark.SparkContext._

 import org.apache.spark.sql.catalyst.expressions._

 import com.datastax.spark.connector.cql.CassandraConnector

 import org.apache.spark.sql.cassandra.CassandraSQLContext


   val conf = new SparkConf().setAppName(SomethingElse)

.setMaster(local)

 .set(spark.cassandra.connection.host, localhost)

 val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable(mydb, mytable)  // this works

 But:

 val cc = new CassandraSQLContext(sc)

  cc.setKeyspace(mydb)

  val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )

 println (count :  +  srdd.count) // does not work

 Exception is thrown:

 Exception in thread main
 com.google.common.util.concurrent.UncheckedExecutionException:
 java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

 at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)

 


 in fact mydb3 is anothery keyspace which I did not tried even to connect
 to it !


 Any idea?


 best,

 /Shahab


 Here is how my SBT looks like:

 libraryDependencies ++= Seq(

 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
 withSources() withJavadoc(),

 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

 net.jpountz.lz4 % lz4 % 1.2.0,

 org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j,
 slf4j-api) exclude(javax.servlet, servlet-api),

 com.datastax.cassandra % cassandra-driver-core % 2.0.4
 intransitive(),

 org.apache.spark %% spark-core % 1.1.0 % provided
 exclude(org.apache.hadoop, hadoop-core),

 org.apache.spark %% spark-streaming % 1.1.0 % provided,

 org.apache.hadoop % hadoop-client % 1.0.4 % provided,

 com.github.nscala-time %% nscala-time % 1.0.0,

 org.scalatest %% scalatest % 1.9.1 % test,

 org.apache.spark %% spark-sql % 1.1.0 %  provided,

 org.apache.spark %% spark-hive % 1.1.0 % provided,

 org.json4s %% json4s-jackson % 3.2.5,

 junit % junit % 4.8.1 % test,

 org.slf4j % slf4j-api % 1.7.7,

 org.slf4j % slf4j-simple %

Re: SparkContext.stop() ?

2014-10-31 Thread Matei Zaharia

You don't have to call it if you just exit your application, but it's useful 
for example in unit tests if you want to create and shut down a separate 
SparkContext for each test.

Matei

 On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote:
 
 In cluster settings if you don't explicitly call sc.stop() your application 
 may hang. Like closing files, network connections, etc, when you're done with 
 them, it's a good idea to call sc.stop(), which lets the spark master know 
 that your application is finished consuming resources.
 
 On Fri, Oct 31, 2014 at 10:13 AM, Daniel Siegmann daniel.siegm...@velos.io 
 mailto:daniel.siegm...@velos.io wrote:
 It is used to shut down the context when you're done with it, but if you're 
 using a context for the lifetime of your application I don't think it matters.
 
 I use this in my unit tests, because they start up local contexts and you 
 can't have multiple local contexts open so each test must stop its context 
 when it's done.
 
 On Fri, Oct 31, 2014 at 11:12 AM, ll duy.huynh@gmail.com 
 mailto:duy.huynh@gmail.com wrote:
 what is it for?  when do we call it?
 
 thanks!
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 
 
 -- 
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning
 
 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io mailto:daniel.siegm...@velos.io W: www.velos.io 
 http://www.velos.io/

Re: Accessing Cassandra with SparkSQL, Does not work?

Hi Shahab,
The apache cassandra version looks great.

I think that doing
  cc.setKeyspace(mydb)
  cc.sql(SELECT * FROM mytable)

versus  
  cc.setKeyspace(mydb)
  cc.sql(select * from mydb.mytable ) 

Is the problem? And if not, would you mind creating a ticket off-list for us to 
help further? You can create one here:
https://github.com/datastax/spark-cassandra-connector/issues
with tag: help wanted :)

Cheers,

- Helena
@helenaedelson

On Oct 31, 2014, at 1:59 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks Helena.
 I tried setting the KeySpace, but I got same result. I also removed other 
 Cassandra dependencies,  but still same exception!
 I also tried to see if this setting appears in the CassandraSQLContext or 
 not, so I printed out the output of configustion
 
 val cc = new CassandraSQLContext(sc)
 
 cc.setKeyspace(mydb)
 
 cc.conf.getAll.foreach(f = println (f._1  +  :  +  f._2))
 
 printout: 
 
 spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2
 
 spark.driver.host :192.168.1.111
 
 spark.cassandra.connection.host : localhost
 
 spark.cassandra.input.split.size : 1
 
 spark.app.name : SomethingElse
 
 spark.fileserver.uri :  http://192.168.1.111:51463
 
 spark.driver.port : 51461
 
 
 spark.master :  local
 
 
 Does it have anything to do with the version of Apache Cassandra that I use?? 
 I use apache-cassandra-2.1.0
 
 
 best,
 /Shahab
 
 The shortened SBT :
 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 
 withSources() withJavadoc(),
 
 net.jpountz.lz4 % lz4 % 1.2.0,
 
 org.apache.spark %% spark-core % 1.1.0 % provided 
 exclude(org.apache.hadoop, hadoop-core),
 
 org.apache.spark %% spark-streaming % 1.1.0 % provided,
 
 org.apache.hadoop % hadoop-client % 1.0.4 % provided,
 
 com.github.nscala-time %% nscala-time % 1.0.0,
 
 org.scalatest %% scalatest % 1.9.1 % test,
 
 org.apache.spark %% spark-sql % 1.1.0 %  provided,
 
 org.apache.spark %% spark-hive % 1.1.0 % provided,
 
 org.json4s %% json4s-jackson % 3.2.5,
 
 junit % junit % 4.8.1 % test,
 
 org.slf4j % slf4j-api % 1.7.7,
 
 org.slf4j % slf4j-simple % 1.7.7,
 
 org.clapper %% grizzled-slf4j % 1.0.2,
 
 log4j % log4j % 1.2.17
 
 
 On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson helena.edel...@datastax.com 
 wrote:
 Hi Shahab,
 
 I’m just curious, are you explicitly needing to use thrift? Just using the 
 connector with spark does not require any thrift dependencies.
 Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1”
 
 But to your question, you declare the keyspace but also unnecessarily repeat 
 the keyspace.table in your select.
 Try this instead:
 
 val cc = new CassandraSQLContext(sc)
 cc.setKeyspace(“keyspaceName)
 val result = cc.sql(SELECT * FROM tableName”) etc
 
 - Helena
 @helenaedelson
 
 On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote:
 
 Hi,
 
 I am using the latest Cassandra-Spark Connector  to access Cassandra tables 
 form Spark. While I successfully managed to connect Cassandra using 
 CassandraRDD, the similar SparkSQL approach does not work. Here is my code 
 for both methods:
 
 import com.datastax.spark.connector._
 
 import org.apache.spark.{SparkConf, SparkContext}
 
 import org.apache.spark.sql._;
 
 import org.apache.spark.SparkContext._
 
 import org.apache.spark.sql.catalyst.expressions._
 
 import com.datastax.spark.connector.cql.CassandraConnector
 
 import org.apache.spark.sql.cassandra.CassandraSQLContext
 
 
 
   val conf = new SparkConf().setAppName(SomethingElse)  
 
.setMaster(local)
 
 .set(spark.cassandra.connection.host, localhost)
 
 val sc: SparkContext = new SparkContext(conf)
 
 
 val rdd = sc.cassandraTable(mydb, mytable)  // this works
 
 But:
 
 val cc = new CassandraSQLContext(sc)
 
  cc.setKeyspace(mydb)
 
  val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )  
 
 
 println (count :  +  srdd.count) // does not work
 
 Exception is thrown:
 
 Exception in thread main 
 com.google.common.util.concurrent.UncheckedExecutionException: 
 java.util.NoSuchElementException: key not found: mydb3.inverseeventtype
 
 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)
 
 at com.google.common.cache.LocalCache.get(LocalCache.java:3934)
 
 
 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)
 
 
 
 
 
 in fact mydb3 is anothery keyspace which I did not tried even to connect to 
 it !  
 
 
 
 Any idea?
 
 
 
 best,
 
 /Shahab
 
 
 
 Here is how my SBT looks like: 
 
 libraryDependencies ++= Seq(
 
 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1 
 withSources() withJavadoc(),
 
 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),
 
 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),
 
 net.jpountz.lz4 % lz4 % 1.2.0,
 
 org.apache.thrift % libthrift % 0.9.1 exclude(org.slf4j, 
 slf4j-api) exclude(javax.servlet,

LinearRegression and model prediction threshold

2014-10-31 Thread Sameer Tilak

Hi All,
I am using LinearRegression and have a question about the details on 
model.predict method. Basically it is predicting variable y given an input 
vector x. However, can someone point me to the documentation about what is the 
threshold used in the predict method? Can that be changed ? I am assuming that 
i/p vector essentially gets mapped to a number and is compared against a 
threshold value and then y is either set to 0 or 1 based on those two numbers. 
Another question I have is if I want to save the model to hdfs for later reuse 
is there a recommended way for doing that? 
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =
  val prediction = model.predict(point.features)
  (point.label, prediction)
}

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab

OK, I created an issue. Hopefully it will be resolved soon.

Again thanks,

best,
/Shahab

On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson helena.edel...@datastax.com
 wrote:

 Hi Shahab,
 The apache cassandra version looks great.

 I think that doing
 cc.setKeyspace(mydb)
 cc.sql(SELECT * FROM mytable)

 versus
 cc.setKeyspace(mydb)
 cc.sql(select * from mydb.mytable )

 Is the problem? And if not, would you mind creating a ticket off-list for
 us to help further? You can create one here:
 https://github.com/datastax/spark-cassandra-connector/issues
 with tag: help wanted :)

 Cheers,

 - Helena
 @helenaedelson

 On Oct 31, 2014, at 1:59 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks Helena.
 I tried setting the KeySpace, but I got same result. I also removed
 other Cassandra dependencies,  but still same exception!
 I also tried to see if this setting appears in the CassandraSQLContext or
 not, so I printed out the output of configustion

 val cc = new CassandraSQLContext(sc)

 cc.setKeyspace(mydb)

 cc.conf.getAll.foreach(f = println (f._1  +  :  +  f._2))

 printout:

 spark.tachyonStore.folderName : spark-ec8ecb6a-1485-4d39-a93c-6f91711804a2

 spark.driver.host :192.168.1.111

 spark.cassandra.connection.host : localhost

 spark.cassandra.input.split.size : 1

 spark.app.name : SomethingElse

 spark.fileserver.uri :  http://192.168.1.111:51463

 spark.driver.port : 51461

 spark.master :  local

 Does it have anything to do with the version of Apache Cassandra that I
 use?? I use apache-cassandra-2.1.0


 best,
 /Shahab

 The shortened SBT :

 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
 withSources() withJavadoc(),

 net.jpountz.lz4 % lz4 % 1.2.0,

 org.apache.spark %% spark-core % 1.1.0 % provided
 exclude(org.apache.hadoop, hadoop-core),

 org.apache.spark %% spark-streaming % 1.1.0 % provided,

 org.apache.hadoop % hadoop-client % 1.0.4 % provided,

 com.github.nscala-time %% nscala-time % 1.0.0,

 org.scalatest %% scalatest % 1.9.1 % test,

 org.apache.spark %% spark-sql % 1.1.0 %  provided,

 org.apache.spark %% spark-hive % 1.1.0 % provided,

 org.json4s %% json4s-jackson % 3.2.5,

 junit % junit % 4.8.1 % test,

 org.slf4j % slf4j-api % 1.7.7,

 org.slf4j % slf4j-simple % 1.7.7,

 org.clapper %% grizzled-slf4j % 1.0.2,

 log4j % log4j % 1.2.17

 On Fri, Oct 31, 2014 at 6:42 PM, Helena Edelson 
 helena.edel...@datastax.com wrote:

 Hi Shahab,

 I’m just curious, are you explicitly needing to use thrift? Just using
 the connector with spark does not require any thrift dependencies.
 Simply: com.datastax.spark %% spark-cassandra-connector %
 1.1.0-beta1”

 But to your question, you declare the keyspace but also unnecessarily
 repeat the keyspace.table in your select.
 Try this instead:

 val cc = new CassandraSQLContext(sc)
 cc.setKeyspace(“keyspaceName)
 val result = cc.sql(SELECT * FROM tableName”) etc

 - Helena
 @helenaedelson

 On Oct 31, 2014, at 1:25 PM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I am using the latest Cassandra-Spark Connector  to access Cassandra
 tables form Spark. While I successfully managed to connect Cassandra using
 CassandraRDD, the similar SparkSQL approach does not work. Here is my code
 for both methods:

 import com.datastax.spark.connector._

 import org.apache.spark.{SparkConf, SparkContext}

 import org.apache.spark.sql._;

 import org.apache.spark.SparkContext._

 import org.apache.spark.sql.catalyst.expressions._

 import com.datastax.spark.connector.cql.CassandraConnector

 import org.apache.spark.sql.cassandra.CassandraSQLContext


   val conf = new SparkConf().setAppName(SomethingElse)

.setMaster(local)

 .set(spark.cassandra.connection.host, localhost)

 val sc: SparkContext = new SparkContext(conf)

  val rdd = sc.cassandraTable(mydb, mytable)  // this works

 But:

 val cc = new CassandraSQLContext(sc)

  cc.setKeyspace(mydb)

  val srdd: SchemaRDD = cc.sql(select * from mydb.mytable )

 println (count :  +  srdd.count) // does not work

 Exception is thrown:

 Exception in thread main
 com.google.common.util.concurrent.UncheckedExecutionException:
 java.util.NoSuchElementException: key not found: mydb3.inverseeventtype

 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201)

 at com.google.common.cache.LocalCache.get(LocalCache.java:3934)

 at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938)

 


 in fact mydb3 is anothery keyspace which I did not tried even to connect
 to it !


 Any idea?


 best,

 /Shahab


 Here is how my SBT looks like:

 libraryDependencies ++= Seq(

 com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1
 withSources() withJavadoc(),

 org.apache.cassandra % cassandra-all % 2.0.9 intransitive(),

 org.apache.cassandra % cassandra-thrift % 2.0.9 intransitive(),

 net.jpountz.lz4 % lz4 % 1.2.0,

 org.apache.thrift % libthrift % 0.9.1

Re: SparkContext.stop() ?

2014-10-31 Thread Marcelo Vanzin

Actually, if you don't call SparkContext.stop(), the event log
information that is used by the history server will be incomplete, and
your application will never show up in the history server's UI.

If you don't use that functionality, then you're probably ok not
calling it as long as your application exits after it's done using the
context.

On Fri, Oct 31, 2014 at 8:12 AM, ll duy.huynh@gmail.com wrote:
 what is it for?  when do we call it?

 thanks!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LinearRegression and model prediction threshold

It sounds like you are asking about logistic regression, not linear
regression. If so, yes that's just what it does. The default would be
0.5 in logistic regression. If you 'clear' the threshold you get the
raw margin out of this and other linear classifiers.

On Fri, Oct 31, 2014 at 7:18 PM, Sameer Tilak ssti...@live.com wrote:
 Hi All,

 I am using LinearRegression and have a question about the details on
 model.predict method. Basically it is predicting variable y given an input
 vector x. However, can someone point me to the documentation about what is
 the threshold used in the predict method? Can that be changed ? I am
 assuming that i/p vector essentially gets mapped to a number and is compared
 against a threshold value and then y is either set to 0 or 1 based on those
 two numbers.

 Another question I have is if I want to save the model to hdfs for later
 reuse is there a recommended way for doing that?

 // Building the model
 val numIterations = 100
 val model = LinearRegressionWithSGD.train(parsedData, numIterations)

 // Evaluate model on training examples and compute training error
 val valuesAndPreds = parsedData.map { point =
   val prediction = model.predict(point.features)
   (point.label, prediction)
 }

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: A Spark Design Problem

Does the following help?

JavaPairRDDbin,key join with JavaPairRDDbin,lock

If you partition both RDDs by the bin id, I think you should be able to get
what you want.

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Fri, Oct 31, 2014 at 11:19 PM, francois.garil...@typesafe.com wrote:

 Hi Steve,

 Are you talking about sequence alignment ?

 —
 FG


 On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:


  The original problem is in biology but the following captures the CS
 issues, Assume I  have a large number of locks and a large number of keys.
 There is a scoring function between keys and locks and a key that  fits a
 lock will have a high score. There may be many keys fitting one lock and a
 key may fit no locks well. The object is to find the best fitting lock for
 each key.

 Assume that the number of keys and locks is high enough that taking the
 cartesian product of the two is computationally impractical. Also assume
 that keys and locks have an attached location which is accurate within an
 error (say 1 Km). Only keys and locks within 1 Km need be compared.
 Now assume I can create a JavaRDDKeys and a JavaRDDLocks . I could
 divide the locations into 1 Km squared bins and look only within a few
 bins. Assume that it is practical to take a cartesian product for all
 elements in a bin but not to keep all elements in memory. I could map my
 RDDs into PairRDDs where the key is the bin assigned by location

 I know how to take the cartesian product of two JavaRDDs but not how to
 take a cartesian product of sets of elements sharing a common key (bin),
 Any suggestions. Assume that in the worst cases the number of elements in a
 bin are too large to keep in memory although if a bin were subdivided into,
 say 100 subbins elements would fit in memory.

 Any thoughts as to how to attack the problem

Re: LinearRegression and model prediction threshold