SparkSQL hang due to PERFLOG method=acquireReadWriteLocks

2014-09-12 Thread linkpatrickliu
I am running Spark Standalone mode with Spark 1.1

I started SparkSQL thrift server as follows:
./sbin/start-thriftserver.sh

Then I use beeline to connect to it.
Now, I can CREATE, SELECT, SHOW the databases or the tables;
But when I DROP or Load data inpath 'kv1.txt' into table src, the
Beeline client will hang.

Here is the log of thriftServer:

14/09/12 13:59:41 INFO Driver: /PERFLOG method=doAuthorization
start=1410501581524 end=1410501581549 duration=25
14/09/12 13:59:41 INFO Driver: /PERFLOG method=compile start=1410501581500
end=1410501581549 duration=49
14/09/12 13:59:41 INFO Driver: PERFLOG method=acquireReadWriteLocks

Anyone can help on this? Many thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-hang-due-to-PERFLOG-method-acquireReadWriteLocks-tp14055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-1.1.0 with make-distribution.sh problem

2014-09-12 Thread Zhanfeng Huo
Hi,

I compile spark with cmd  bash -x make-distribution.sh -Pyarn -Phive 
--skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 
-Phadoop.version=2.3.0, it errors.
   
How to use it correct?

   message:
+ set -o pipefail 
+ set -e 
+++ dirname make-distribution.sh 
++ cd . 
++ pwd 
+ FWDIR=/home/syn/spark/spark-1.1.0 
+ DISTDIR=/home/syn/spark/spark-1.1.0/dist 
+ SPARK_TACHYON=false 
+ MAKE_TGZ=false 
+ NAME=none 
+ (( 7 )) 
+ case $1 in 
+ break 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ which git 
++ git rev-parse --short HEAD 
+ GITREV=5f6f219 
+ '[' '!' -z 5f6f219 ']' 
+ GITREVSTRING=' (git revision 5f6f219)' 
+ unset GITREV 
+ which mvn 
++ mvn help:evaluate -Dexpression=project.version 
++ grep -v INFO 
++ tail -n 1 
+ VERSION=1.1.0 
++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test 
--with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 
++ grep -v INFO 
++ tail -n 1 
+ SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'

Best Regards


Zhanfeng Huo


Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Akhil Das
Hi Jim,

This approach will not work right out of the box. You need to understand a
few things. A driver program and the master will be communicating with each
other, for that you need to open up certain ports for your public ip (Read
about port forwarding http://portforward.com/). Also on the cluster you
need to set *spark.driver.host* and *spark.driver.port *(by default this is
random) pointing to your public ip and the port that you opened up.


Thanks
Best Regards

On Thu, Sep 11, 2014 at 11:52 PM, Jim Carroll jimfcarr...@gmail.com wrote:

 Hello all,

 I'm trying to run a Driver on my local network with a deployment on EC2 and
 it's not working. I was wondering if either the master or slave instances
 (in standalone) connect back to the driver program.

 I outlined the details of my observations in a previous post but here is
 what I'm seeing:

 I have v1.1.0 installed (the new tag) on ec2 using the spark-ec2 script.
 I have the same version of the code built locally.
 I edited the master security group to allow inbound access from anywhere to
 7077 and 8080.
 I see a connection take place.
 I see the workers fail with a timeout when any job is run.
 The master eventually removes the driver's job.

 I supposed this makes sense if there's a requirement for either the worker
 or the master to be on the same network as the driver. Is that the case?

 Thanks
 Jim




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Network-requirements-between-Driver-Master-and-Slave-tp13997.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Davies Liu
This is a bug, I had create an issue to track this:
https://issues.apache.org/jira/browse/SPARK-3500

Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

Before next bugfix release, you can workaround this by:

srdd = sqlCtx.jsonRDD(rdd)
srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu wrote:
 Hi All,

 I'm having some trouble with the coalesce and repartition functions for
 SchemaRDD objects in pyspark.  When I run:

 sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
 '{foo:baz}'])).coalesce(1)

 I get this error:

 Py4JError: An error occurred while calling o94.coalesce. Trace:
 py4j.Py4JException: Method coalesce([class java.lang.Integer, class
 java.lang.Boolean]) does not exist

 For context, I have a dataset stored in a parquet file, and I'm using
 SQLContext to make several queries against the data.  I then register the
 results of these as queries new tables in the SQLContext.  Unfortunately
 each new table has the same number of partitions as the original (despite
 being much smaller).  Hence my interest in coalesce and repartition.

 Has anybody else encountered this bug?  Is there an alternate workflow I
 should consider?

 I am running the 1.1.0 binaries released today.

 best,
 -Brad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Yarn Over-allocating Containers

2014-09-12 Thread praveen seluka
Hi all

Am seeing a strange issue in Spark on Yarn(Stable). Let me know if known,
or am missing something as it looks very fundamental.

Launch a Spark job with 2 Containers. addContainerRequest called twice and
then calls allocate to AMRMClient. This will get 2 Containers allocated.
Fine as of now.

Reporter thread starts. Now, if 1 of the container dies - this is what
happens. Reporter thread adds another addContainerRequest and the next
allocate is *actually* getting back 3 containers (total no of container
requests from beginning). Reporter thread has a check to discard (release)
excess container and ends-up releasing 2.

In summary, job starts with 2 containers, 1 dies(lets say), reporter thread
adds 1 more container request, subsequently gets back 3 allocated
containers(from yarn) and discards 2 as it needed just 1.

Thanks
Praveen


Re: single worker vs multiple workers on each machine

2014-09-12 Thread Mayur Rustagi
Another aspect to keep in mind is JVM above 8-10GB starts to misbehave.
Typically better to split up ~ 15GB intervals.
if you are choosing machines 10GB/Core is a approx to maintain.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Fri, Sep 12, 2014 at 2:59 AM, Sean Owen so...@cloudera.com wrote:

 As I understand, there's generally not an advantage to running many
 executors per machine. Each will already use all the cores, and
 multiple executors just means splitting the available memory instead
 of having one big pool. I think there may be an argument at extremes
 of scale where one JVM with a huge heap might have excessive GC
 pauses, or too many open files, that kind of thing?

 On Thu, Sep 11, 2014 at 8:42 PM, Mike Sam mikesam...@gmail.com wrote:
  Hi There,
 
  I am new to Spark and I was wondering when you have so much memory on
 each
  machine of the cluster, is it better to run multiple workers with limited
  memory on each machine or is it better to run a single worker with
 access to
  the majority of the machine memory? If the answer is it depends, would
 you
  please elaborate?
 
  Thanks,
  Mike

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Computing mean and standard deviation by key

2014-09-12 Thread rzykov
Is it possible to use  DoubleRDDFunctions
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
  
for calculating mean and std dev for Paired RDDs (key, value)?

Now I'm using an approach with ReduceByKey but want to make my code more
concise and readable.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p14062.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
I generally call values.stats, e.g.:

val stats = myPairRdd.values.stats

On Fri, Sep 12, 2014 at 4:46 PM, rzykov rzy...@gmail.com wrote:

 Is it possible to use  DoubleRDDFunctions
 
 https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
 
 for calculating mean and std dev for Paired RDDs (key, value)?

 Now I'm using an approach with ReduceByKey but want to make my code more
 concise and readable.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p14062.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark-1.1.0 with make-distribution.sh problem

2014-09-12 Thread Zhanfeng Huo
resolved:

./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn 
-Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests

This code is a bit misleading




Zhanfeng Huo
 
From: Zhanfeng Huo
Date: 2014-09-12 14:13
To: user
Subject: spark-1.1.0 with make-distribution.sh problem
Hi,

I compile spark with cmd  bash -x make-distribution.sh -Pyarn -Phive 
--skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0 
-Phadoop.version=2.3.0, it errors.
   
How to use it correct?

   message:
+ set -o pipefail 
+ set -e 
+++ dirname make-distribution.sh 
++ cd . 
++ pwd 
+ FWDIR=/home/syn/spark/spark-1.1.0 
+ DISTDIR=/home/syn/spark/spark-1.1.0/dist 
+ SPARK_TACHYON=false 
+ MAKE_TGZ=false 
+ NAME=none 
+ (( 7 )) 
+ case $1 in 
+ break 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ '[' -z /home/syn/usr/jdk1.7.0_55 ']' 
+ which git 
++ git rev-parse --short HEAD 
+ GITREV=5f6f219 
+ '[' '!' -z 5f6f219 ']' 
+ GITREVSTRING=' (git revision 5f6f219)' 
+ unset GITREV 
+ which mvn 
++ mvn help:evaluate -Dexpression=project.version 
++ grep -v INFO 
++ tail -n 1 
+ VERSION=1.1.0 
++ mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phive --skip-java-test 
--with-tachyon --tgz -Pyarn.version=2.3.0 -Phadoop.version=2.3.0 
++ grep -v INFO 
++ tail -n 1 
+ SPARK_HADOOP_VERSION=' -X,--debug Produce execution debug output'

Best Regards


Zhanfeng Huo


Re: Computing mean and standard deviation by key

2014-09-12 Thread Sean Owen
These functions operate on an RDD of Double which is not what you have, so
no this is not a way to use DoubleRDDFunctions. See earlier in the thread
for canonical solutions.
On Sep 12, 2014 8:06 AM, rzykov rzy...@gmail.com wrote:

 Tried this:

 ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matchedida,
 pricea), (matchedidb, priceb))) = ((matchedida, matchedidb), (if(priceb 
 0) (pricea/priceb).toDouble else 0.toDouble))}
 .groupByKey
 .values.stats
 .first

 Error:
 console:37: error: could not find implicit value for parameter num:
 Numeric[Iterable[Double]]
   .values.stats





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p14065.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
Oh I see, I think you're trying to do something like (in SQL):

SELECT order, mean(price) FROM orders GROUP BY order

In this case, I'm not aware of a way to use the DoubleRDDFunctions, since
you have a single RDD of pairs where each pair is of type (KeyType,
Iterable[Double]).

It seems to me that you want to write a function:

def stats(numList: Iterable[Double]): org.apache.spark.util.StatCounter

and then use

pairRdd.mapValues( value = stats(value) )




On Fri, Sep 12, 2014 at 5:05 PM, rzykov rzy...@gmail.com wrote:

 Tried this:

 ordersRDD.join(ordersRDD).map{case((partnerid, itemid),((matchedida,
 pricea), (matchedidb, priceb))) = ((matchedida, matchedidb), (if(priceb 
 0) (pricea/priceb).toDouble else 0.toDouble))}
 .groupByKey
 .values.stats
 .first

 Error:
 console:37: error: could not find implicit value for parameter num:
 Numeric[Iterable[Double]]
   .values.stats





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p14065.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: RDD memory questions

2014-09-12 Thread Boxian Dong
Thank you very much for your help :)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805p14069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Perserving conf files when restarting ec2 cluster

2014-09-12 Thread jerryye
Hi,
I'm using --use-existing-master to launch a previous stopped ec2 cluster
with spark-ec2. However, my configuration files are overwritten once is the
cluster is setup. What's the best way of preserving existing configuration
files in spark/conf.

Alternatively, what I'm trying to do is set SPARK_WORKER_CORES to use fewer
cores than default. Is there a nice way to pass this while starting the
cluster or is it possible to do this in SparkContext?

I'm currently copying the configuration and restarting the cluster using the
stop-all.sh and start-all.sh scripts. Anything better would be greatly
appreciated.

Thanks!

- jerry



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Perserving-conf-files-when-restarting-ec2-cluster-tp14070.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kyro deserialisation error

2014-09-12 Thread ayandas84
Hi,

I am also facing the same problem. Has any one found out the solution yet?

It just returns a vague set of characters.

Please help..


Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Exception while deserializing and fetching task:
com.esotericsoftware.kryo.KryoException: Unable to find class: 

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv;

 $(*,.02468:@BDFHJLNPRTVXZ^`bdfhlnprtvD^bjlnpv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv:

 $(*,.02468:@BDFHJNPRTVXZ\`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv8@p=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtv=

 $(*,.02468:@BDFHJLNPRTVXZ\^`bdfhjlnprtvxz|~
Serialization trace:




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kyro-deserialisation-error-tp6798p14071.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread yh18190
Hi all,

I am currently working on pyspark for NLP processing etc.I am using TextBlob
python library.Normally in a standalone mode it easy to install the external
python libraries .In case of cluster mode I am facing problem to install
these libraries on worker nodes remotely.I cannot access each and every
worker machine to install these libs in python path.I tried to use
Sparkcontext pyfiles option to ship .zip files..But the problem is  these
python packages needs to be get installed on worker machines.Could anyone
let me know wat are different ways of doing it so that this lib-Textblob
could be available in python path.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
Dear all,

I am sorry. This was a false alarm

There was some issue in the RDD processing logic which leads to large
backlog. Once I fixed the issues in my processing logic, I can see all
messages being pulled nicely without any Block Removed error. I need to
tune certain configurations in my Kafka Consumer to modify the data rate
and also the batch size.

Sorry again.


Regards,
Dibyendu

On Thu, Sep 11, 2014 at 8:13 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  This is my case about broadcast variable:

 14/07/21 19:49:13 INFO Executor: Running task ID 4
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 2)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 2 in 95 ms on localhost 
 (progress: 3/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 3 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 3 directly to driver
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO Executor: Finished task ID 3
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:5 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 5
 14/07/21 19:49:13 INFO BlockManager: Removing broadcast 0
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 3)*14/07/21 
 19:49:13 INFO ContextCleaner: Cleaned broadcast 0*
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms on localhost 
 (progress: 4/106)
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO BlockManager: Removing block broadcast_0*14/07/21 
 19:49:13 INFO MemoryStore: Block broadcast_0 of size 202564 dropped from 
 memory (free 886623436)*
 14/07/21 19:49:13 INFO ContextCleaner: Cleaned shuffle 0
 14/07/21 19:49:13 INFO ShuffleBlockManager: Deleted all files for shuffle 0
 14/07/21 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5
 14/07/21 
 http://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+514/07/21 
 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5
 14/07/21 
 http://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+514/07/21 
 19:49:13 INFO TableOutputFormat: Created table instance for hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 4 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 4 directly to driver
 14/07/21 19:49:13 INFO Executor: Finished task ID 4
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:6 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 6
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 4)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 4 in 80 ms on localhost 
 (progress: 5/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 5 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 5 directly to driver
 14/07/21 19:49:13 INFO Executor: Finished task ID 5
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:7 as TID 7 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:7 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 7
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 5)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 5 in 77 ms on localhost 
 (progress: 6/106)
 14/07/21 19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/21 19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/21 19:49:13 ERROR Executor: Exception in task ID 6
 java.io.FileNotFoundException: http://172.31.34.174:52070/broadcast_0
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
   at 
 org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:196)
   at 
 org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89)
   at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   

Serving data

2014-09-12 Thread Marius Soutier
Hi there,

I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote 
Scalding jobs - one-off, read data from HDFS, count words, write counts back to 
HDFS.

Now I want to display these counts in a dashboard. Since Spark allows to cache 
RDDs in-memory and you have to explicitly terminate your app (and there’s even 
a new JDBC server in 1.1), I’m assuming it’s possible to keep an app running 
indefinitely and query an in-memory RDD from the outside (via SparkSQL for 
example).

Is this how others are using Spark? Or are you just dumping job results into 
message queues or databases?


Thanks
- Marius


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Jeoffrey Lim
Our issue could be related to this problem as described in:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-td14027.html
which
the DStream is processed for every 1 hour batch duration.

I have implemented IO throttling in the Receiver as well in our Kafka
consumer, and our backlog is not that large.

NFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping
INFO : org.apache.spark.storage.BlockManager - Dropping block
*input-0-1410443074600* from memory
INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of
size 12651900 dropped from memory (free 21220667)
INFO : org.apache.spark.storage.BlockManagerInfo - Removed
input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 in memory
(size: 12.1 MB, free: 100.6 MB)

The question that I have now is: how to prevent the
MemoryStore/BlockManager of dropping the block inputs? And should they be
logged in the level WARN/ERROR?


Thanks.


On Fri, Sep 12, 2014 at 4:45 PM, Dibyendu Bhattacharya [via Apache Spark
User List] ml-node+s1001560n14075...@n3.nabble.com wrote:

 Dear all,

 I am sorry. This was a false alarm

 There was some issue in the RDD processing logic which leads to large
 backlog. Once I fixed the issues in my processing logic, I can see all
 messages being pulled nicely without any Block Removed error. I need to
 tune certain configurations in my Kafka Consumer to modify the data rate
 and also the batch size.

 Sorry again.


 Regards,
 Dibyendu

 On Thu, Sep 11, 2014 at 8:13 PM, Nan Zhu [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14075i=0 wrote:

  This is my case about broadcast variable:

 14/07/21 19:49:13 INFO Executor: Running task ID 4
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 2)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 2 in 95 ms on localhost 
 (progress: 3/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 3 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 3 directly to driver
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO Executor: Finished task ID 3
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:5 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 5
 14/07/21 19:49:13 INFO BlockManager: Removing broadcast 0
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 3)*14/07/21 
 19:49:13 INFO ContextCleaner: Cleaned broadcast 0*
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms on localhost 
 (progress: 4/106)
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO BlockManager: Removing block broadcast_0*14/07/21 
 19:49:13 INFO MemoryStore: Block broadcast_0 of size 202564 dropped from 
 memory (free 886623436)*
 14/07/21 19:49:13 INFO ContextCleaner: Cleaned shuffle 0
 14/07/21 19:49:13 INFO ShuffleBlockManager: Deleted all files for shuffle 0
 14/07/21 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5
 14/07/21 
 http://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+514/07/21 
 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5
 14/07/21 
 http://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+514/07/21 
 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 4 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 4 directly to driver
 14/07/21 19:49:13 INFO Executor: Finished task ID 4
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:6 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 6
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 4)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 4 in 80 ms on localhost 
 (progress: 5/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 5 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 5 directly to driver
 14/07/21 19:49:13 INFO Executor: Finished task ID 5
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:7 as TID 7 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:7 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 7
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 5)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 5 in 77 ms on localhost 
 (progress: 6/106)
 14/07/21 

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Dibyendu Bhattacharya
I agree,

Even the Low Level Kafka Consumer which I have written has tunable IO
throttling which help me solve this issue ... But question remains , even
if there are large backlog, why Spark drop the unprocessed memory blocks ?

Dib

On Fri, Sep 12, 2014 at 5:47 PM, Jeoffrey Lim jeoffr...@gmail.com wrote:

 Our issue could be related to this problem as described in:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-td14027.html
  which
 the DStream is processed for every 1 hour batch duration.

 I have implemented IO throttling in the Receiver as well in our Kafka
 consumer, and our backlog is not that large.

 NFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping
 INFO : org.apache.spark.storage.BlockManager - Dropping block
 *input-0-1410443074600* from memory
 INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of
 size 12651900 dropped from memory (free 21220667)
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed
 input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 in memory
 (size: 12.1 MB, free: 100.6 MB)

 The question that I have now is: how to prevent the
 MemoryStore/BlockManager of dropping the block inputs? And should they be
 logged in the level WARN/ERROR?


 Thanks.


 On Fri, Sep 12, 2014 at 4:45 PM, Dibyendu Bhattacharya [via Apache Spark
 User List] [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14081i=0 wrote:

 Dear all,

 I am sorry. This was a false alarm

 There was some issue in the RDD processing logic which leads to large
 backlog. Once I fixed the issues in my processing logic, I can see all
 messages being pulled nicely without any Block Removed error. I need to
 tune certain configurations in my Kafka Consumer to modify the data rate
 and also the batch size.

 Sorry again.


 Regards,
 Dibyendu

 On Thu, Sep 11, 2014 at 8:13 PM, Nan Zhu [hidden email]
 http://user/SendEmail.jtp?type=nodenode=14075i=0 wrote:

  This is my case about broadcast variable:

 14/07/21 19:49:13 INFO Executor: Running task ID 4
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 2)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 2 in 95 ms on localhost 
 (progress: 3/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 3 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 3 directly to driver
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO Executor: Finished task ID 3
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:5 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 5
 14/07/21 19:49:13 INFO BlockManager: Removing broadcast 0
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 3)*14/07/21 
 19:49:13 INFO ContextCleaner: Cleaned broadcast 0*
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms on localhost 
 (progress: 4/106)
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO BlockManager: Removing block broadcast_0*14/07/21 
 19:49:13 INFO MemoryStore: Block broadcast_0 of size 202564 dropped from 
 memory (free 886623436)*
 14/07/21 19:49:13 INFO ContextCleaner: Cleaned shuffle 0
 14/07/21 19:49:13 INFO ShuffleBlockManager: Deleted all files for shuffle 0
 14/07/21 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5
 14/07/21 
 http://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+514/07/21 
 19:49:13 INFO HadoopRDD: Input split: 
 hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5
 14/07/21 
 http://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+514/07/21 
 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 4 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 4 directly to driver
 14/07/21 19:49:13 INFO Executor: Finished task ID 4
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on 
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:6 as 11885 bytes 
 in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 6
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 4)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 4 in 80 ms on localhost 
 (progress: 5/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for 
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 5 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 5 directly to driver
 14/07/21 19:49:13 INFO Executor: Finished task ID 5
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:7 as TID 7 on 
 

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-12 Thread Aniket Bhatnagar
Hi Reinis

Try if the exclude suggestion from me and Sean works for you. If not, can
you turn on verbose class loading to see from where
javax.servlet.ServletRegistration is loaded? The class should load
from org.mortbay.jetty
% servlet-api % jettyVersion. If it loads from some other jar, you would
have to exclude it from your build.

Hope it helps.

Thanks,
Aniket

On 12 September 2014 02:21, sp...@orbit-x.de wrote:

 Thank you, Aniket for your hint!

 Alas, I am facing really hellish situation as it seems, because I have
 integration tests using BOTH spark and HBase (Minicluster). Thus I get
 either:

 class javax.servlet.ServletRegistration's signer information does not
 match signer information of other classes in the same package
 java.lang.SecurityException: class javax.servlet.ServletRegistration's
 signer information does not match signer information of other classes in
 the same package
 at java.lang.ClassLoader.checkCerts(ClassLoader.java:943)
 at java.lang.ClassLoader.preDefineClass(ClassLoader.java:657)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:785)

 or:

 [info]   Cause: java.lang.ClassNotFoundException:
 org.mortbay.jetty.servlet.Context
 [info]   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 [info]   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 [info]   at java.security.AccessController.doPrivileged(Native Method)
 [info]   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
 [info]   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 [info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
 [info]   at
 org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:661)
 [info]   at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:552)
 [info]   at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:720)

 I am searching the web already for a week trying to figure out how to make
 this work :-/

 all the help or hints are greatly appreciated
 reinis


 --
 -Original-Nachricht-
 Von: Aniket Bhatnagar aniket.bhatna...@gmail.com
 An: sp...@orbit-x.de
 Cc: user user@spark.apache.org
 Datum: 11-09-2014 20:00
 Betreff: Re: Re[2]: HBase 0.96+ with Spark 1.0+


 Dependency hell... My fav problem :).

 I had run into a similar issue with hbase and jetty. I cant remember thw
 exact fix, but is are excerpts from my dependencies that may be relevant:

 val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version
 excludeAll(

   ExclusionRule(organization = javax.servlet),

   ExclusionRule(organization = javax.servlet.jsp),

 ExclusionRule(organization = org.mortbay.jetty)

   )

   val hadoop2MapRedClient = org.apache.hadoop %
 hadoop-mapreduce-client-core % hadoop2Version

   val hbase = org.apache.hbase % hbase % hbaseVersion excludeAll(

   ExclusionRule(organization = org.apache.maven.wagon),

   ExclusionRule(organization = org.jboss.netty),

 ExclusionRule(organization = org.mortbay.jetty),

   ExclusionRule(organization = org.jruby) // Don't need HBASE's jruby.
 It pulls in whole lot of other dependencies like joda-time.

   )

 val sparkCore = org.apache.spark %% spark-core % sparkVersion

   val sparkStreaming = org.apache.spark %% spark-streaming %
 sparkVersion

   val sparkSQL = org.apache.spark %% spark-sql % sparkVersion

   val sparkHive = org.apache.spark %% spark-hive % sparkVersion

   val sparkRepl = org.apache.spark %% spark-repl % sparkVersion

   val sparkAll = Seq (

   sparkCore excludeAll(

   ExclusionRule(organization = org.apache.hadoop)), // We assume hadoop
 2 and hence omit hadoop 1 dependencies

   sparkSQL,

   sparkStreaming,

   hadoop2MapRedClient,

   hadoop2Common,

   org.mortbay.jetty % servlet-api % 3.0.20100224

   )

 On Sep 11, 2014 8:05 PM, sp...@orbit-x.de wrote:

 Hi guys,

 any luck with this issue, anyone?

 I aswell tried all the possible exclusion combos to a no avail.

 thanks for your ideas
 reinis

 -Original-Nachricht-
  Von: Stephen Boesch java...@gmail.com
  An: user user@spark.apache.org
  Datum: 28-06-2014 15:12
  Betreff: Re: HBase 0.96+ with Spark 1.0+
 
  Hi Siyuan,
 Thanks for the input. We are preferring to use the SparkBuild.scala
 instead of maven. I did not see any protobuf.version related settings in
 that file. But - as noted by Sean Owen - in any case the issue we are
 facing presently is about the duplicate incompatible javax.servlet entries
 - apparently from the org.mortbay artifacts.


 
  2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com:
  Hi Stephen,
 
 I am using spark1.0+ HBase0.96.2. This is what I did:
 1) rebuild spark using: mvn -Dhadoop.version=2.3.0
 -Dprotobuf.version=2.5.0 -DskipTests clean package
 2) In spark-env.sh, set SPARK_CLASSPATH =
 /path-to/hbase-protocol-0.96.2-hadoop2.jar

 
 Hopefully it can help.
 Siyuan


 
  On Sat, Jun 28, 2014 at 8:52 

Re: Out of memory with Spark Streaming

2014-09-12 Thread Aniket Bhatnagar
Hi all

Sorry but this was totally my mistake. In my persistence logic, I was
creating async http client instance in RDD foreach but was never closing it
leading to memory leaks.

Apologies for wasting everyone's time.

Thanks,
Aniket

On 12 September 2014 02:20, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Which version of spark are you running?

 If you are running the latest one, then could try running not a window but
 a simple event count on every 2 second batch, and see if you are still
 running out of memory?

 TD


 On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I did change it to be 1 gb. It still ran out of memory but a little later.

 The streaming job isnt handling a lot of data. In every 2 seconds, it
 doesn't get more than 50 records. Each record size is not more than 500
 bytes.
  On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com
 wrote:

 You could set spark.executor.memory to something bigger than the
 default (512mb)


 On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am running a simple Spark Streaming program that pulls in data from
 Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
 data and persists to a store.

 The program is running in local mode right now and runs out of memory
 after a while. I am yet to investigate heap dumps but I think Spark isn't
 releasing memory after processing is complete. I have even tried changing
 storage level to disk only.

 Help!

 Thanks,
 Aniket






Error Driver disassociated while running the spark job

2014-09-12 Thread 남윤민
I got this error from the executor's stderr:   
[akka.tcp://sparkDriver@saturn00:49464] disassociated! Shutting down. What is 
the reason of Actor not found?   


// Yoonmin Nam



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Jim Carroll
Hi Akhil,

Thanks! I guess in short that means the master (or slaves?) connect back to
the driver. This seems like a really odd way to work given the driver needs
to already connect to the master on port 7077. I would have thought that if
the driver could initiate a connection to the master, that would be all
that's required.

Can you describe what it is about the architecture that requires the master
to connect back to the driver even when the driver initiates a connection to
the master? Just curious.

Thanks anyway.
Jim
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Network-requirements-between-Driver-Master-and-Slave-tp13997p14086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Network requirements between Driver, Master, and Slave

2014-09-12 Thread Mayur Rustagi
Driver needs a consistent connection to the master in standalone mode as whole 
bunch of client stuff happens on the driver. So calls like parallelize send 
data from driver to the master  collect send data from master to the driver. 

If you are looking to avoid the connect you can look into embedded driver model 
in yarn where the driver will also run inside the cluster  hence reliability  
connectivity is a given. 
-- 
Regards,
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi

On Fri, Sep 12, 2014 at 6:46 PM, Jim Carroll jimfcarr...@gmail.com
wrote:

 Hi Akhil,
 Thanks! I guess in short that means the master (or slaves?) connect back to
 the driver. This seems like a really odd way to work given the driver needs
 to already connect to the master on port 7077. I would have thought that if
 the driver could initiate a connection to the master, that would be all
 that's required.
 Can you describe what it is about the architecture that requires the master
 to connect back to the driver even when the driver initiates a connection to
 the master? Just curious.
 Thanks anyway.
 Jim
  
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Network-requirements-between-Driver-Master-and-Slave-tp13997p14086.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: What is a pre built package of Apache Spark

2014-09-12 Thread andrew.craft
Hi,

I do not see any pre-build binaries on the site currently. I am using the
make-distribution.sh to create a binary package. After that is done the the
file generated by that will allow you to run execute the scripts in the bin
folder.

HTH,
Andrew



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-a-pre-built-package-of-Apache-Spark-tp14080p14088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error Driver disassociated while running the spark job

2014-09-12 Thread Akhil Das
What is your system setup? Can you paste the spark-env.sh? Looks like you
have some issues with your configuration.

Thanks
Best Regards

On Fri, Sep 12, 2014 at 6:31 PM, 남윤민 rony...@dgist.ac.kr wrote:

 I got this error from the executor's stderr:





 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 14/09/12 21:53:36 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/09/12 21:53:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/09/12 21:53:36 INFO SecurityManager: Changing view acls to: root
 14/09/12 21:53:36 INFO SecurityManager: Changing modify acls to: root
 14/09/12 21:53:36 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(root); users 
 with modify permissions: Set(root)
 14/09/12 21:53:36 INFO Slf4jLogger: Slf4jLogger started
 14/09/12 21:53:36 INFO Remoting: Starting remoting
 14/09/12 21:53:37 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverPropsFetcher@saturn09:35376]
 14/09/12 21:53:37 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://driverPropsFetcher@saturn09:35376]
 14/09/12 21:53:37 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 35376.
 14/09/12 21:53:37 INFO SecurityManager: Changing view acls to: root
 14/09/12 21:53:37 INFO SecurityManager: Changing modify acls to: root
 14/09/12 21:53:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(root); users 
 with modify permissions: Set(root)
 14/09/12 21:53:37 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
 down remote daemon.
 14/09/12 21:53:37 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
 daemon shut down; proceeding with flushing remote transports.
 14/09/12 21:53:37 INFO Slf4jLogger: Slf4jLogger started
 14/09/12 21:53:37 INFO Remoting: Starting remoting
 14/09/12 21:53:37 INFO Remoting: Remoting shut down
 14/09/12 21:53:37 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
 shut down.
 14/09/12 21:53:37 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@saturn09:47076]
 14/09/12 21:53:37 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@saturn09:47076]
 14/09/12 21:53:37 INFO Utils: Successfully started service 'sparkExecutor' on 
 port 47076.
 14/09/12 21:53:37 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
 akka.tcp://sparkDriver@saturn00:49464/user/CoarseGrainedScheduler
 14/09/12 21:53:37 INFO WorkerWatcher: Connecting to worker 
 akka.tcp://sparkWorker@saturn09:43584/user/Worker
 14/09/12 21:53:37 INFO WorkerWatcher: Successfully connected to 
 akka.tcp://sparkWorker@saturn09:43584/user/Worker
 14/09/12 21:53:37 INFO CoarseGrainedExecutorBackend: Successfully registered 
 with driver
 14/09/12 21:53:37 INFO SecurityManager: Changing view acls to: root
 14/09/12 21:53:37 INFO SecurityManager: Changing modify acls to: root
 14/09/12 21:53:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(root); users 
 with modify permissions: Set(root)
 14/09/12 21:53:37 INFO Slf4jLogger: Slf4jLogger started
 14/09/12 21:53:37 INFO Remoting: Starting remoting
 14/09/12 21:53:37 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkExecutor@saturn09:34812]
 14/09/12 21:53:37 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkExecutor@saturn09:34812]
 14/09/12 21:53:37 INFO Utils: Successfully started service 'sparkExecutor' on 
 port 34812.
 14/09/12 21:53:37 INFO AkkaUtils: Connecting to MapOutputTracker: 
 akka.tcp://sparkDriver@saturn00:49464/user/MapOutputTracker
 14/09/12 21:53:37 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
 akka.tcp://sparkDriver@saturn00:49464/user/CoarseGrainedScheduler
 14/09/12 21:53:37 ERROR OneForOneStrategy: Actor not found for: 
 ActorSelection[Actor[akka.tcp://sparkDriver@saturn00:49464/]/user/MapOutputTracker]
 akka.actor.ActorNotFound: Actor not found for: 
 ActorSelection[Actor[akka.tcp://sparkDriver@saturn00:49464/]/user/MapOutputTracker]
   at 
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:66)
   at 
 akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:64)
   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
   at 
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
   at 
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
   at 
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
   at 
 akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
   at 
 

Re: spark sql - create new_table as select * from table

2014-09-12 Thread jamborta
thanks. I will try to do that way.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Fwd: Define the name of the outputs with Java-Spark.

2014-09-12 Thread Guillermo Ortiz
I would like to define the names of my output in Spark, I have a process
which write many fails and I would like to name them, is it possible? I
guess that it's not possible with saveAsText method.

It would be something similar to the MultipleOutput of Hadoop.


How to initiate a shutdown of Spark Streaming context?

2014-09-12 Thread stanley
In  spark streaming programming document
https://spark.apache.org/docs/latest/streaming-programming-guide.html  ,
it specifically states how to shut down a spark streaming context: 

The existing application is shutdown gracefully (see
StreamingContext.stop(...) or JavaStreamingContext.stop(...) for graceful
shutdown options) which ensure data that have been received is completely
processed before shutdown. 

However, my question is, how do I initiate a shut down? Assume I am
upgrading a running Spark streaming system, how do I send a message to the
running spark streaming instance so that the call StreamingContext.stop(...)
is made?

Thanks,

Stanley



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initiate-a-shutdown-of-Spark-Streaming-context-tp14092.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why I get java.lang.OutOfMemoryError: Java heap space with join ?

2014-09-12 Thread Jaonary Rabarisoa
Dear all,


I'm facing the following problem and I can't figure how to solve it.

I need to join 2 rdd in order to find their intersections. The first RDD
represent an image encoded in base64 string associated with image id. The
second RDD represent a set of geometric primitives (rectangle) associated
with image id. My goal is to draw these primitives on the corresponding
image. So my first attempt is to join images and primitives by image ids
and then do the drawing.

But, when I do

*primitives.join(images) *


I got the following error :

*java.lang.OutOfMemoryError: Java heap space*
* at java.util.Arrays.copyOf(Arrays.java:2367)*
* at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)*
* at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)*
* at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)*
* at java.lang.StringBuilder.append(StringBuilder.java:204)*
* at
java.io.ObjectInputStream$BlockDataInputStream.readUTFSpan(ObjectInputStream.java:3143)*
* at
java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3051)*
* at
java.io.ObjectInputStream$BlockDataInputStream.readLongUTF(ObjectInputStream.java:3034)*
* at java.io.ObjectInputStream.readString(ObjectInputStream.java:1642)*
* at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1341)*
* at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
* at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
* at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
* at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
* at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)*
* at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)*
* at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)*
* at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)*
* at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)*
* at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)*
* at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)*
* at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)*
* at
org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1031)*
* at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)*
* at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)*
* at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)*
* at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)*
* at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)*
* at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)*
* at scala.collection.Iterator$class.foreach(Iterator.scala:727)*
* at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)*
* at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)*

I notice that sometime if I change the partition of the images RDD with
coalesce I can get it working.

What I'm doing wrong ?

Cheers,

Jaonary


Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Brad Miller
Hi Davies,

Thanks for the quick fix. I'm sorry to send out a bug report on release day
- 1.1.0 really is a great release.  I've been running the 1.1 branch for a
while and there's definitely lots of good stuff.

For the workaround, I think you may have meant:

srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Note:
_schema_rdd - _jschema_rdd
false - False

That workaround seems to work fine (in that I've observed the correct
number of partitions in the web-ui, although haven't tested it any beyond
that).

Thanks!
-Brad

On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav...@databricks.com wrote:

 This is a bug, I had create an issue to track this:
 https://issues.apache.org/jira/browse/SPARK-3500

 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

 Before next bugfix release, you can workaround this by:

 srdd = sqlCtx.jsonRDD(rdd)
 srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


 On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu
 wrote:
  Hi All,
 
  I'm having some trouble with the coalesce and repartition functions for
  SchemaRDD objects in pyspark.  When I run:
 
  sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
  '{foo:baz}'])).coalesce(1)
 
  I get this error:
 
  Py4JError: An error occurred while calling o94.coalesce. Trace:
  py4j.Py4JException: Method coalesce([class java.lang.Integer, class
  java.lang.Boolean]) does not exist
 
  For context, I have a dataset stored in a parquet file, and I'm using
  SQLContext to make several queries against the data.  I then register the
  results of these as queries new tables in the SQLContext.  Unfortunately
  each new table has the same number of partitions as the original (despite
  being much smaller).  Hence my interest in coalesce and repartition.
 
  Has anybody else encountered this bug?  Is there an alternate workflow I
  should consider?
 
  I am running the 1.1.0 binaries released today.
 
  best,
  -Brad



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Patrick Wendell
[moving to user@]

This would typically be accomplished with a union() operation. You
can't mutate an RDD in-place, but you can create a new RDD with a
union() which is an inexpensive operator.

On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur
archit279tha...@gmail.com wrote:
 Hi,

 We have a use case where we are planning to keep sparkcontext alive in a
 server and run queries on it. But the issue is we have  a continuous
 flowing data the comes in batches of constant duration(say, 1hour). Now we
 want to exploit the schemaRDD and its benefits of columnar caching and
 compression. Is there a way I can append the new batch (uncached) to the
 older(cached) batch without losing the older data from cache and caching
 the whole dataset.

 Thanks and Regards,


 Archit Thakur.
 Sr Software Developer,
 Guavus, Inc.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Nested Case Classes (Found and Required Same)

2014-09-12 Thread iramaraju
I think this is a popular issue, but need help figuring a way around if this
issue is unresolved. I have a dataset that has more than 70 columns. To have
all the columns fit into my RDD, I am experimenting the following. (I intend
to use the InputData to parse the file and have 3 or 4 columnsets to
accommodate the full list of variables)

case class ColumnSet(C1: Double , C2: Double , C3: Double)
case class InputData(EQN: String, ts: String,Set1 :ColumnSet,Set2
:ColumnSet)

val  set1 = ColumnSet(1,2,3)
val a = InputData(a,a,set1,set1) 

returns the following

console:16: error: type mismatch;
 found   : ColumnSet
 required: ColumnSet
   val a = InputData(a,a,set1,set1)

Where as the same code works fine in my scala console.

Is there a work around for my problem ?

Regards
Ram



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case-Classes-Found-and-Required-Same-tp14096.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Nested Case Classes (Found and Required Same)

2014-09-12 Thread Ramaraju Indukuri
This is only a problem in shell, but works fine in batch mode though. I am
also interested in how others are solving the problem of case class
limitation on number of variables.

Regards
Ram

On Fri, Sep 12, 2014 at 12:12 PM, iramaraju iramar...@gmail.com wrote:

 I think this is a popular issue, but need help figuring a way around if
 this
 issue is unresolved. I have a dataset that has more than 70 columns. To
 have
 all the columns fit into my RDD, I am experimenting the following. (I
 intend
 to use the InputData to parse the file and have 3 or 4 columnsets to
 accommodate the full list of variables)

 case class ColumnSet(C1: Double , C2: Double , C3: Double)
 case class InputData(EQN: String, ts: String,Set1 :ColumnSet,Set2
 :ColumnSet)

 val  set1 = ColumnSet(1,2,3)
 val a = InputData(a,a,set1,set1)

 returns the following

 console:16: error: type mismatch;
  found   : ColumnSet
  required: ColumnSet
val a = InputData(a,a,set1,set1)

 Where as the same code works fine in my scala console.

 Is there a work around for my problem ?

 Regards
 Ram



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case-Classes-Found-and-Required-Same-tp14096.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
--
Ramaraju Indukuri


slides from df talk at global big data conference

2014-09-12 Thread Mohit Jaggi
http://engineering.ayasdi.com/2014/09/11/df-dataframes-on-spark/


Re: Nested Case Classes (Found and Required Same)

2014-09-12 Thread Prashant Sharma
What is your spark version ?  This was fixed I suppose. Can you try it with
latest release ?

Prashant Sharma



On Fri, Sep 12, 2014 at 9:47 PM, Ramaraju Indukuri iramar...@gmail.com
wrote:

 This is only a problem in shell, but works fine in batch mode though. I am
 also interested in how others are solving the problem of case class
 limitation on number of variables.

 Regards
 Ram

 On Fri, Sep 12, 2014 at 12:12 PM, iramaraju iramar...@gmail.com wrote:

 I think this is a popular issue, but need help figuring a way around if
 this
 issue is unresolved. I have a dataset that has more than 70 columns. To
 have
 all the columns fit into my RDD, I am experimenting the following. (I
 intend
 to use the InputData to parse the file and have 3 or 4 columnsets to
 accommodate the full list of variables)

 case class ColumnSet(C1: Double , C2: Double , C3: Double)
 case class InputData(EQN: String, ts: String,Set1 :ColumnSet,Set2
 :ColumnSet)

 val  set1 = ColumnSet(1,2,3)
 val a = InputData(a,a,set1,set1)

 returns the following

 console:16: error: type mismatch;
  found   : ColumnSet
  required: ColumnSet
val a = InputData(a,a,set1,set1)

 Where as the same code works fine in my scala console.

 Is there a work around for my problem ?

 Regards
 Ram



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Nested-Case-Classes-Found-and-Required-Same-tp14096.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 --
 Ramaraju Indukuri



Re: Spark SQL and running parquet tables?

2014-09-12 Thread DanteSama
So, after toying around a bit, here's what I ended up with. First off,
there's no function registerTempTable -- registerTable seems to be
enough to work (it's the same whether directly on a SchemaRDD or on a
SqlContext being passed an RDD). The problem I encountered after was
reloading a table in one actor and referencing it another. 

The environment I had set has 2 types of Akka actors, a Query and a
Refresher. They share a reference (passed in on creation via
Props(classOf[Actor], sqlContext). The Refresher would simply reload the
parquet file and refresh the table:

sqlContext
  .parquetFile(dataDir)
  .registerAsTable(tableName)

The WebService would query it:

sqlContext.sql(query with tableName).collect()

This would break, the Refresher actor would work and be able to query, but
the Query actor would return that the table doesn't exist.


I now removed the Refresher and just updated the Query actor to refresh its
table if it's stale.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: coalesce on SchemaRDD in pyspark

2014-09-12 Thread Davies Liu
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller bmill...@eecs.berkeley.edu wrote:
 Hi Davies,

 Thanks for the quick fix. I'm sorry to send out a bug report on release day
 - 1.1.0 really is a great release.  I've been running the 1.1 branch for a
 while and there's definitely lots of good stuff.

 For the workaround, I think you may have meant:

 srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)

Yes, thanks for the correction.

 Note:
 _schema_rdd - _jschema_rdd
 false - False

 That workaround seems to work fine (in that I've observed the correct number
 of partitions in the web-ui, although haven't tested it any beyond that).

 Thanks!
 -Brad

 On Thu, Sep 11, 2014 at 11:30 PM, Davies Liu dav...@databricks.com wrote:

 This is a bug, I had create an issue to track this:
 https://issues.apache.org/jira/browse/SPARK-3500

 Also, there is PR to fix this: https://github.com/apache/spark/pull/2369

 Before next bugfix release, you can workaround this by:

 srdd = sqlCtx.jsonRDD(rdd)
 srdd2 = SchemaRDD(srdd._schema_rdd.coalesce(N, false, None), sqlCtx)


 On Thu, Sep 11, 2014 at 6:12 PM, Brad Miller bmill...@eecs.berkeley.edu
 wrote:
  Hi All,
 
  I'm having some trouble with the coalesce and repartition functions for
  SchemaRDD objects in pyspark.  When I run:
 
  sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
  '{foo:baz}'])).coalesce(1)
 
  I get this error:
 
  Py4JError: An error occurred while calling o94.coalesce. Trace:
  py4j.Py4JException: Method coalesce([class java.lang.Integer, class
  java.lang.Boolean]) does not exist
 
  For context, I have a dataset stored in a parquet file, and I'm using
  SQLContext to make several queries against the data.  I then register
  the
  results of these as queries new tables in the SQLContext.  Unfortunately
  each new table has the same number of partitions as the original
  (despite
  being much smaller).  Hence my interest in coalesce and repartition.
 
  Has anybody else encountered this bug?  Is there an alternate workflow I
  should consider?
 
  I am running the 1.1.0 binaries released today.
 
  best,
  -Brad



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark and Scala

2014-09-12 Thread Deep Pradhan
There is one thing that I am confused about.
Spark has codes that have been implemented in Scala. Now, can we run any
Scala code on the Spark framework? What will be the difference in the
execution of the scala code in normal systems and on Spark?
The reason for my question is the following:
I had a variable
*val temp = some operations*
This temp was being created inside the loop, so as to manually throw it out
of the cache, every time the loop ends I was calling *temp.unpersist()*,
this was returning an error saying that *value unpersist is not a method of
Int*, which means that temp is an Int.
Can some one explain to me why I was not able to call *unpersist* on *temp*?

Thank You


Re: Spark and Scala

2014-09-12 Thread Nicholas Chammas
unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
that doesn't change in Spark.

On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You



Stable spark streaming app

2014-09-12 Thread Tim Smith
Hi,

Anyone have a stable streaming app running in production? Can you
share some overview of the app and setup like number of nodes, events
per second, broad stream processing workflow, config highlights etc?

Thanks,

Tim

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: split a RDD by pencetage

2014-09-12 Thread pankaj.arora
You can use MapPartitions to achieve this.

/split each partition into 10 equal parts with each part having number as
its id
val splittedRDD = self.mapPartitions((itr)= {
Iterate over this iterator and breaks this iterator into 10 parts.
val iterators = Array[ArrayBuffer[T]](10)
var i =0
for(tuple - itr) {
  iterators(i%10) = tuple
i+=1
}
i = 0
iterators.map((i,_))
})

//filter rdd for each part broken above and flat map to get array of RDDs
var rddArray = (0 to 10).toArray.map(i = splittedRDD.filter(_._1 ==
i).flatMap(x=x)

The code is not written in IDE it will work with little modifications



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/split-a-RDD-by-pencetage-tp333p14106.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj.arora
Hi Patrick,

What if all the data has to be keep in cache all time. If applying union
result in new RDD then caching this would result into keeping older as well
as this into memory hence duplicating data.

Below is what i understood from your comment.

sqlContext.cacheTable(existingRDD)// caches the RDD as schema RDD uses
columnar compression

existingRDD.union(newRDD).registerAsTable(newTable)

sqlContext.cacheTable(newTable) -- duplicated data



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Tim Smith
Similar issue (Spark 1.0.0). Streaming app runs for a few seconds
before these errors start to pop all over the driver logs:

14/09/12 17:30:23 WARN TaskSetManager: Loss was due to java.lang.Exception
java.lang.Exception: Could not compute split, block
input-4-1410542878200 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.UnionPartition.iterator(UnionRDD.scala:33)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:74)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

I am using MEMORY_AND_DISK_SER for all my RDDs so I should not be
losing any blocks unless I run out of disk space, right?



On Fri, Sep 12, 2014 at 5:24 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
 I agree,

 Even the Low Level Kafka Consumer which I have written has tunable IO
 throttling which help me solve this issue ... But question remains , even if
 there are large backlog, why Spark drop the unprocessed memory blocks ?

 Dib

 On Fri, Sep 12, 2014 at 5:47 PM, Jeoffrey Lim jeoffr...@gmail.com wrote:

 Our issue could be related to this problem as described in:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-td14027.html
 which the DStream is processed for every 1 hour batch duration.

 I have implemented IO throttling in the Receiver as well in our Kafka
 consumer, and our backlog is not that large.

 NFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for
 dropping
 INFO : org.apache.spark.storage.BlockManager - Dropping block
 input-0-1410443074600 from memory
 INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600
 of size 12651900 dropped from memory (free 21220667)
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed
 input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 in memory
 (size: 12.1 MB, free: 100.6 MB)

 The question that I have now is: how to prevent the
 MemoryStore/BlockManager of dropping the block inputs? And should they be
 logged in the level WARN/ERROR?


 Thanks.


 On Fri, Sep 12, 2014 at 4:45 PM, Dibyendu Bhattacharya [via Apache Spark
 User List] [hidden email] wrote:

 Dear all,

 I am sorry. This was a false alarm

 There was some issue in the RDD processing logic which leads to large
 backlog. Once I fixed the issues in my processing logic, I can see all
 messages being pulled nicely without any Block Removed error. I need to tune
 certain configurations in my Kafka Consumer to modify the data rate and also
 the batch size.

 Sorry again.


 Regards,
 Dibyendu

 On Thu, Sep 11, 2014 at 8:13 PM, Nan Zhu [hidden email] wrote:

 This is my case about broadcast variable:

 14/07/21 19:49:13 INFO Executor: Running task ID 4
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 2)
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 2 in 95 ms on
 localhost (progress: 3/106)
 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for
 hdfstest_customers
 14/07/21 19:49:13 INFO Executor: Serialized size of result for 3 is 596
 14/07/21 19:49:13 INFO Executor: Sending result for 3 directly to driver
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO Executor: Finished task ID 3
 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on
 executor localhost: localhost (PROCESS_LOCAL)
 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:5 as 11885
 bytes in 0 ms
 14/07/21 19:49:13 INFO Executor: Running task ID 5
 14/07/21 19:49:13 INFO BlockManager: Removing broadcast 0
 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 3)
 14/07/21 19:49:13 INFO ContextCleaner: Cleaned broadcast 0
 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms on
 localhost (progress: 4/106)
 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally
 14/07/21 19:49:13 INFO 

Re: Unable to ship external Python libraries in PYSPARK

2014-09-12 Thread Davies Liu
By SparkContext.addPyFile(xx.zip), the xx.zip will be copies to all
the workers
and stored in temporary directory, the path to xx.zip will be in the sys.path on
worker machines, so you can import xx in your jobs, it does not need to be
installed on worker machines.

PS: the package or module should be in the top level in xx.zip, or it cannot
be imported. such as :

daviesliu@dm:~/work/tmp$ zipinfo textblob.zip
Archive:  textblob.zip   3245946 bytes   517 files
drwxr-xr-x  3.0 unx0 bx stor 12-Sep-14 10:10 textblob/
-rw-r--r--  3.0 unx  203 tx defN 12-Sep-14 10:10 textblob/__init__.py
-rw-r--r--  3.0 unx  563 bx defN 12-Sep-14 10:10 textblob/__init__.pyc
-rw-r--r--  3.0 unx61510 tx defN 12-Sep-14 10:10 textblob/_text.py
-rw-r--r--  3.0 unx68316 bx defN 12-Sep-14 10:10 textblob/_text.pyc
-rw-r--r--  3.0 unx 2962 tx defN 12-Sep-14 10:10 textblob/base.py
-rw-r--r--  3.0 unx 5501 bx defN 12-Sep-14 10:10 textblob/base.pyc
-rw-r--r--  3.0 unx27621 tx defN 12-Sep-14 10:10 textblob/blob.py

you can get this textblob.zip by:

pip install textblob
cd /xxx/xx/site-package/
zip -r path_to_store/textblob.zip textblob

Davies


On Fri, Sep 12, 2014 at 1:39 AM, yh18190 yh18...@gmail.com wrote:
 Hi all,

 I am currently working on pyspark for NLP processing etc.I am using TextBlob
 python library.Normally in a standalone mode it easy to install the external
 python libraries .In case of cluster mode I am facing problem to install
 these libraries on worker nodes remotely.I cannot access each and every
 worker machine to install these libs in python path.I tried to use
 Sparkcontext pyfiles option to ship .zip files..But the problem is  these
 python packages needs to be get installed on worker machines.Could anyone
 let me know wat are different ways of doing it so that this lib-Textblob
 could be available in python path.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
You can always use sqlContext.uncacheTable to uncache the old table.
​

On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora pankajarora.n...@gmail.com
wrote:

 Hi Patrick,

 What if all the data has to be keep in cache all time. If applying union
 result in new RDD then caching this would result into keeping older as well
 as this into memory hence duplicating data.

 Below is what i understood from your comment.

 sqlContext.cacheTable(existingRDD)// caches the RDD as schema RDD uses
 columnar compression

 existingRDD.union(newRDD).registerAsTable(newTable)

 sqlContext.cacheTable(newTable) -- duplicated data



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14107.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark SQL and running parquet tables?

2014-09-12 Thread DanteSama
Turns out it was Spray with a bad route -- the results weren't updating
despite the table running. This thread can be ignored.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: EOFException when reading from HDFS

2014-09-12 Thread kent
Can anyone help me with this?  I have been stuck on this for a few days and
don't know what to try anymore.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-reading-from-HDFS-tp13844p14115.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Yarn Over-allocating Containers

2014-09-12 Thread Sandy Ryza
Hi Praveen,

I believe you are correct.  I noticed this a little while ago and had a fix
for it as part of SPARK-1714, but that's been delayed.  I'll look into this
a little deeper and file a JIRA.

-Sandy

On Thu, Sep 11, 2014 at 11:44 PM, praveen seluka praveen.sel...@gmail.com
wrote:

 Hi all

 Am seeing a strange issue in Spark on Yarn(Stable). Let me know if known,
 or am missing something as it looks very fundamental.

 Launch a Spark job with 2 Containers. addContainerRequest called twice and
 then calls allocate to AMRMClient. This will get 2 Containers allocated.
 Fine as of now.

 Reporter thread starts. Now, if 1 of the container dies - this is what
 happens. Reporter thread adds another addContainerRequest and the next
 allocate is *actually* getting back 3 containers (total no of container
 requests from beginning). Reporter thread has a check to discard (release)
 excess container and ends-up releasing 2.

 In summary, job starts with 2 containers, 1 dies(lets say), reporter
 thread adds 1 more container request, subsequently gets back 3 allocated
 containers(from yarn) and discards 2 as it needed just 1.

 Thanks
 Praveen



EOFException when reading from HDFS

2014-09-12 Thread kents
I just started playing with Spark. So I ran the SimpleApp program from
tutorial (https://spark.apache.org/docs/1.0.0/quick-start.html), which works
fine.

However, if I change the file location from local to hdfs, then I get an
EOFException.

I did some search online which suggests this error is caused by hadoop
version conflicts, I made the suggested modification in my sbt file, but
still get the same error.

I am using CDH5.1, code and full error log is below. Any help is greatly
appreciated.

Thanks



Scala:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {

val logFile = hdfs://plogs001.sjc.domain.com:8020/tmp/data.txt //
Should be some file on your system  
val conf = new SparkConf()
  .setMaster(spark://plogs004.sjc.domain.com:7077)
  .setAppName(SimpleApp)
  .set(spark.executor.memory, 1g)
val sc = new SparkContext(conf)

//val logFile = /tmp/data.txt // Should be some file on your system
//val conf = new SparkConf().setAppName(Simple Application)
//val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line = line.contains(a)).count()
val numBs = logData.filter(line = line.contains(b)).count()
println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))
  }
}



SBT:

name := Simple Project

version := 1.0

scalaVersion := 2.10.4

libraryDependencies += org.apache.spark %% spark-core % 1.0.0

libraryDependencies += org.apache.hadoop % hadoop-client %
2.3.0-cdh5.1.0

resolvers += Akka Repository at http://repo.akka.io/releases/;

resolvers += Cloudera Repository at
https://repository.cloudera.com/artifactory/cloudera-repos/;



Error Log:
[hdfs@plogs001 test1]$ spark-submit --class SimpleApp --master
spark://sp...@plogs004.sjc.domain.com:7077
target/scala-2.10/simple-project_2.10-1.0.jar 
14/09/09 16:56:41 INFO spark.SecurityManager: Changing view acls to: hdfs 
14/09/09 16:56:41 INFO spark.SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(hdfs) 
14/09/09 16:56:41 INFO slf4j.Slf4jLogger: Slf4jLogger started 
14/09/09 16:56:41 INFO Remoting: Starting remoting 
14/09/09 16:56:41 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sp...@plogs001.sjc.domain.com:34607] 
14/09/09 16:56:41 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sp...@plogs001.sjc.domain.com:34607] 
14/09/09 16:56:41 INFO spark.SparkEnv: Registering MapOutputTracker 
14/09/09 16:56:41 INFO spark.SparkEnv: Registering BlockManagerMaster 
14/09/09 16:56:41 INFO storage.DiskBlockManager: Created local directory at
/tmp/spark-local-20140909165641-375e 
14/09/09 16:56:41 INFO storage.MemoryStore: MemoryStore started with
capacity 294.9 MB. 
14/09/09 16:56:41 INFO network.ConnectionManager: Bound socket to port 40833
with id = ConnectionManagerId(plogs001.sjc.domain.com,40833) 
14/09/09 16:56:41 INFO storage.BlockManagerMaster: Trying to register
BlockManager 
14/09/09 16:56:41 INFO storage.BlockManagerInfo: Registering block manager
plogs001.sjc.domain.com:40833 with 294.9 MB RAM 
14/09/09 16:56:41 INFO storage.BlockManagerMaster: Registered BlockManager 
14/09/09 16:56:41 INFO spark.HttpServer: Starting HTTP Server 
14/09/09 16:56:42 INFO server.Server: jetty-8.y.z-SNAPSHOT 
14/09/09 16:56:42 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:47419 
14/09/09 16:56:42 INFO broadcast.HttpBroadcast: Broadcast server started at
http://172.16.30.161:47419
14/09/09 16:56:42 INFO spark.HttpFileServer: HTTP File server directory is
/tmp/spark-7026d0b6-777e-4dd3-9bbb-e79d7487e7d7 
14/09/09 16:56:42 INFO spark.HttpServer: Starting HTTP Server 
14/09/09 16:56:42 INFO server.Server: jetty-8.y.z-SNAPSHOT 
14/09/09 16:56:42 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:42388 
14/09/09 16:56:42 INFO server.Server: jetty-8.y.z-SNAPSHOT 
14/09/09 16:56:42 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040 
14/09/09 16:56:42 INFO ui.SparkUI: Started SparkUI at
http://plogs001.sjc.domain.com:4040
14/09/09 16:56:42 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable 
14/09/09 16:56:42 INFO spark.SparkContext: Added JAR
file:/home/hdfs/kent/test1/target/scala-2.10/simple-project_2.10-1.0.jar at
http://172.16.30.161:42388/jars/simple-project_2.10-1.0.jar with timestamp
1410307002737 
14/09/09 16:56:42 INFO client.AppClient$ClientActor: Connecting to master
spark://plogs004.sjc.domain.com:7077... 
14/09/09 16:56:42 INFO storage.MemoryStore: ensureFreeSpace(155704) called
with curMem=0, maxMem=309225062 
14/09/09 16:56:42 INFO storage.MemoryStore: Block broadcast_0 stored as
values to memory (estimated size 152.1 KB, free 294.8 MB) 
14/09/09 16:56:42 INFO cluster.SparkDeploySchedulerBackend: Connected 

Re: Configuring Spark for heterogenous hardware

2014-09-12 Thread Victor Tso-Guillen
Ping...

On Thu, Sep 11, 2014 at 5:44 PM, Victor Tso-Guillen v...@paxata.com wrote:

 So I have a bunch of hardware with different core and memory setups. Is
 there a way to do one of the following:

 1. Express a ratio of cores to memory to retain. The spark worker config
 would represent all of the cores and all of the memory usable for any
 application, and the application would take a fraction that sustains the
 ratio. Say I have 4 cores and 20G of RAM. I'd like it to have the worker
 take 4/20 and the executor take 5 G for each of the 4 cores, thus maxing
 both out. If there were only 16G with the same ratio requirement, it would
 only take 3 cores and 12G in a single executor and leave the rest.

 2. Have the executor take whole number ratios of what it needs. Say it is
 configured for 2/8G and the worker has 4/20. So we can give the executor
 2/8G (which is true now) or we can instead give it 4/16G, maxing out one of
 the two parameters.

 Either way would allow me to get my heterogenous hardware all
 participating in the work of my spark cluster, presumably without
 endangering spark's assumption of homogenous execution environments in the
 dimensions of memory and cores. If there's any way to do this, please
 enlighten me.



Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Nicholas Chammas
Andrew,

This email was pretty helpful. I feel like this stuff should be summarized
in the docs somewhere, or perhaps in a blog post.

Do you know if it is?

Nick


On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:

 The locality is how close the data is to the code that's processing it.
  PROCESS_LOCAL means data is in the same JVM as the code that's running, so
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the
 same node, or in another executor on the same node, so is a little slower
 because the data has to travel across an IPC connection.  RACK_LOCAL is
 even slower -- data is on a different server so needs to be sent over the
 network.

 Spark switches to lower locality levels when there's no unprocessed data
 on a node that has idle CPUs.  In that situation you have two options: wait
 until the busy CPUs free up so you can start another task that uses data on
 that server, or start a new task on a farther away server that needs to
 bring data from that remote place.  What Spark typically does is wait a bit
 in the hopes that a busy CPU frees up.  Once that timeout expires, it
 starts moving the data from far away to the free CPU.

 The main tunable option is how far long the scheduler waits before
 starting to move data rather than code.  Those are the spark.locality.*
 settings here: http://spark.apache.org/docs/latest/configuration.html

 If you want to prevent this from happening entirely, you can set the
 values to ridiculously high numbers.  The documentation also mentions that
 0 has special meaning, so you can try that as well.

 Good luck!
 Andrew


 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu
 wrote:

 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd
 assume that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.

 When these happen things get extremely slow.

 Does this mean that the executor got terminated and restarted?

 Is there a way to prevent this from happening (barring the machine
 actually going down, I'd rather stick with the same process)?





Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Archit Thakur
LittleCode snippet:

line1: cacheTable(existingRDDTableName)
line2: //some operations which will materialize existingRDD dataset.
line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
line4: cacheTable(new_existingRDDTableName)
line5: //some operation that will materialize new _existingRDD.

now, what we expect is in line4 rather than caching both
existingRDDTableName and new_existingRDDTableName, it should cache only
new_existingRDDTableName. but we cannot explicitly uncache
existingRDDTableName because we want the union to use the cached
existingRDDTableName. since being lazy new_existingRDDTableName could be
materialized later and by then we cant lose existingRDDTableName from
cache.

What if keep the same name of the new table

so, cacheTable(existingRDDTableName)
existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
cacheTable(existingRDDTableName) //might not be needed again.

Will our both cases be satisfied, that it uses existingRDDTableName from
cache for union and dont duplicate the data in the cache but somehow,
append to the older cacheTable.

Thanks and Regards,


Archit Thakur.
Sr Software Developer,
Guavus, Inc.

On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora pankajarora.n...@gmail.com
wrote:

 I think i should elaborate usecase little more.

 So we have UI dashboard whose response time is quite fast as all the data
 is
 cached. Users query data based on time range and also there is always new
 data coming into the system at predefined frequency lets say 1 hour.

 As you said i can uncache tables it will basically drop all data from
 memory.
 I cannot afford losing my cache even for short interval. As all queries
 from
 UI will get slow till the time cache loads again. UI response time needs to
 be predictable and shoudl be fast enough so that user does not get
 irritated.

 Also i cannot keep two copies of data(till newrdd materialize) into memory
 as it will surpass total available memory in system.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread SK
Hi,

I am using the Spark 1.1.0 version that was released yesterday. I recompiled
my program to use the latest version using sbt assembly after modifying
project.sbt to use the 1.1.0 version. The compilation goes through and the
jar is built. When I run the jar using spark-submit, I get an error: Cannot
load main class from JAR. This program was working with version 1.0.2. The
class does have a main method. So far I have never had problems recompiling
and running the jar, when I have upgraded to new versions. Is there anything
different I need to do for 1.1.0 ?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load-main-class-from-JAR-tp14123.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread SK
This issue is resolved. Looks like in the new spark-submit, the jar path has
to be at the end of the options. Earlier I could specify this path in any
order on the command line. 

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load-main-class-from-JAR-tp14123p14124.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL hang due to PERFLOG method=acquireReadWriteLocks

2014-09-12 Thread Michael Armbrust
What is in your hive-site.xml?

On Thu, Sep 11, 2014 at 11:04 PM, linkpatrickliu linkpatrick...@live.com
wrote:

 I am running Spark Standalone mode with Spark 1.1

 I started SparkSQL thrift server as follows:
 ./sbin/start-thriftserver.sh

 Then I use beeline to connect to it.
 Now, I can CREATE, SELECT, SHOW the databases or the tables;
 But when I DROP or Load data inpath 'kv1.txt' into table src, the
 Beeline client will hang.

 Here is the log of thriftServer:

 14/09/12 13:59:41 INFO Driver: /PERFLOG method=doAuthorization
 start=1410501581524 end=1410501581549 duration=25
 14/09/12 13:59:41 INFO Driver: /PERFLOG method=compile start=1410501581500
 end=1410501581549 duration=49
 14/09/12 13:59:41 INFO Driver: PERFLOG method=acquireReadWriteLocks

 Anyone can help on this? Many thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-hang-due-to-PERFLOG-method-acquireReadWriteLocks-tp14055.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
Ah, I see. So basically what you need is something like cache write through
support which exists in Shark but not implemented in Spark SQL yet. In
Shark, when inserting data into a table that has already been cached, the
newly inserted data will be automatically cached and “union”-ed with the
existing table content. SPARK-1671
https://issues.apache.org/jira/browse/SPARK-1671 was created to track
this feature. We’ll work on that.

Currently, as a workaround, instead of doing union at the RDD level, you
may try cache the new table, union it with the old table and then query the
union-ed table. The drawbacks is higher code complexity and you end up with
lots of temporary tables. But the performance should be reasonable.
​

On Fri, Sep 12, 2014 at 1:19 PM, Archit Thakur archit279tha...@gmail.com
wrote:

 LittleCode snippet:

 line1: cacheTable(existingRDDTableName)
 line2: //some operations which will materialize existingRDD dataset.
 line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
 line4: cacheTable(new_existingRDDTableName)
 line5: //some operation that will materialize new _existingRDD.

 now, what we expect is in line4 rather than caching both
 existingRDDTableName and new_existingRDDTableName, it should cache only
 new_existingRDDTableName. but we cannot explicitly uncache
 existingRDDTableName because we want the union to use the cached
 existingRDDTableName. since being lazy new_existingRDDTableName could be
 materialized later and by then we cant lose existingRDDTableName from
 cache.

 What if keep the same name of the new table

 so, cacheTable(existingRDDTableName)
 existingRDD.union(newRDD).registerAsTable(existingRDDTableName)
 cacheTable(existingRDDTableName) //might not be needed again.

 Will our both cases be satisfied, that it uses existingRDDTableName from
 cache for union and dont duplicate the data in the cache but somehow,
 append to the older cacheTable.

 Thanks and Regards,


 Archit Thakur.
 Sr Software Developer,
 Guavus, Inc.

 On Sat, Sep 13, 2014 at 12:01 AM, pankaj arora pankajarora.n...@gmail.com
  wrote:

 I think i should elaborate usecase little more.

 So we have UI dashboard whose response time is quite fast as all the data
 is
 cached. Users query data based on time range and also there is always new
 data coming into the system at predefined frequency lets say 1 hour.

 As you said i can uncache tables it will basically drop all data from
 memory.
 I cannot afford losing my cache even for short interval. As all queries
 from
 UI will get slow till the time cache loads again. UI response time needs
 to
 be predictable and shoudl be fast enough so that user does not get
 irritated.

 Also i cannot keep two copies of data(till newrdd materialize) into memory
 as it will surpass total available memory in system.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Re-Use-Case-of-mutable-RDD-any-ideas-around-will-help-tp14095p14112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Where do logs go in StandAlone mode

2014-09-12 Thread Tim Smith
Spark 1.0.0

I write logs out from my app using this object:

object LogService extends Logging {

/** Set reasonable logging levels for streaming if the user has not
configured log4j. */
 def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
if (!log4jInitialized) {
  // We first log something to initialize Spark's default logging,
then we override the
  // logging level.
  logInfo(Setting log level to [WARN] for streaming example. +
 To override add a custom log4j.properties to the classpath.)
  Logger.getRootLogger.setLevel(Level.WARN)
}
  }
}

Later, I set LogService.setStreamingLogLevels() and then use logInfo etc.

This works well when I run the app under Yarn, all the logs show up
under the container logs but when I run the app in Standalone mode, I
can't find these logs in neither the master, worker or driver logs. So
where do they go?

Thanks,

Tim

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
Newbie for Java. so please be specific on how to resolve this,

The command I was running is

$ ./spark-submit --driver-class-path
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/hbase_inputformat.py
 
quickstart.cloudera data1

14/09/12 14:12:07 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
':/usr/lib/hbase/hbase-protocol-0.98.1-cdh5.1.0.jar:/usr/lib/hbase/hbase-protocol-0.98.1-cdh5.1.0.jar'
as a work-around.
Traceback (most recent call last):
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/examples/src/main/python/hbase_inputformat.py,
line 61, in module
sc = SparkContext(appName=HBaseInputFormat)
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py,
line 107, in __init__
conf)
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py,
line 155, in _do_init
self._jsc = self._initialize_context(self._conf._jconf)
  File
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/python/pyspark/context.py,
line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
  File
/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/java_gateway.py,
line 701, in __call__
self._fqn)
  File
/usr/lib/python2.6/site-packages/py4j-0.8.2.1-py2.6.egg/py4j/protocol.py,
line 300, in get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.spark.SparkException: Found both spark.driver.extraClassPath
and SPARK_CLASSPATH. Use only the former.
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:300)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:298)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:298)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:286)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:286)
at org.apache.spark.SparkContext.init(SparkContext.scala:158)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-class-conflict-tp14127.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Thrift JDBC server deployment for production

2014-09-12 Thread Michael Armbrust
Something like the following should let you launch the thrift server on
yarn.


HADOOP_CONF_DIR=/etc/hadoop/conf HIVE_SERVER2_THRIFT_PORT=12345 MASTER=yarn-
client ./sbin/start-thriftserver.sh


On Thu, Sep 11, 2014 at 8:30 PM, Denny Lee denny.g@gmail.com wrote:

 Could you provide some context about running this in yarn-cluster mode?
 The Thrift server that's included within Spark 1.1 is based on Hive 0.12.
 Hive has been able to work against YARN since Hive 0.10.  So when you start
 the thrift server, provided you copied the hive-site.xml over to the Spark
 conf folder, it should be able to connect to the same Hive metastore and
 then execute Hive against your YARN cluster.

 On Wed, Sep 10, 2014 at 11:55 PM, vasiliy zadonsk...@gmail.com wrote:

 Hi, i have a question about spark sql Thrift JDBC server.

 Is there a best practice for spark SQL deployement ? If i understand right
 script

 ./sbin/start-thriftserver.sh

 starts Thrift JDBC server in local mode. Is there an script options for
 running this server on yarn-cluster mode ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Executor garbage collection

2014-09-12 Thread Tim Smith
Hi,

Anyone setting any explicit GC options for the executor jvm? If yes,
what and how did you arrive at them?

Thanks,

- Tim

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.1 failure. class conflict?

2014-09-12 Thread freedafeng
The same command passed in another quick-start vm (v4.7) which has hbase 0.96
installed. maybe there are some conflicts for the newer hbase version and
spark 1.1.0? just my guess.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-1-failure-class-conflict-tp14127p14131.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NullWritable not serializable

2014-09-12 Thread Du Li
Hi,

I was trying the following on spark-shell (built with apache master and hadoop 
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable.

I got the same problem in similar code of my app which uses the newly released 
Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 
under either hadoop 2.40 and 0.23.10.

Anybody knows what caused the problem?

Thanks,
Du


import org.apache.hadoop.io.{NullWritable, Text}
val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)
val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text])
rdd2.collect
val rdd3 = sc.sequenceFile[NullWritable,Text](./test_data)
rdd3.collect




Re: When does Spark switch from PROCESS_LOCAL to NODE_LOCAL or RACK_LOCAL?

2014-09-12 Thread Tsai Li Ming
Another observation I had was reading over local filesystem with “file://“. it 
was stated as PROCESS_LOCAL which was confusing. 

Regards,
Liming

On 13 Sep, 2014, at 3:12 am, Nicholas Chammas nicholas.cham...@gmail.com 
wrote:

 Andrew,
 
 This email was pretty helpful. I feel like this stuff should be summarized in 
 the docs somewhere, or perhaps in a blog post.
 
 Do you know if it is?
 
 Nick
 
 
 On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
 The locality is how close the data is to the code that's processing it.  
 PROCESS_LOCAL means data is in the same JVM as the code that's running, so 
 it's really fast.  NODE_LOCAL might mean that the data is in HDFS on the same 
 node, or in another executor on the same node, so is a little slower because 
 the data has to travel across an IPC connection.  RACK_LOCAL is even slower 
 -- data is on a different server so needs to be sent over the network.
 
 Spark switches to lower locality levels when there's no unprocessed data on a 
 node that has idle CPUs.  In that situation you have two options: wait until 
 the busy CPUs free up so you can start another task that uses data on that 
 server, or start a new task on a farther away server that needs to bring data 
 from that remote place.  What Spark typically does is wait a bit in the hopes 
 that a busy CPU frees up.  Once that timeout expires, it starts moving the 
 data from far away to the free CPU.
 
 The main tunable option is how far long the scheduler waits before starting 
 to move data rather than code.  Those are the spark.locality.* settings here: 
 http://spark.apache.org/docs/latest/configuration.html
 
 If you want to prevent this from happening entirely, you can set the values 
 to ridiculously high numbers.  The documentation also mentions that 0 has 
 special meaning, so you can try that as well.
 
 Good luck!
 Andrew
 
 
 On Thu, Jun 5, 2014 at 3:13 PM, Sung Hwan Chung coded...@cs.stanford.edu 
 wrote:
 I noticed that sometimes tasks would switch from PROCESS_LOCAL (I'd assume 
 that this means fully cached) to NODE_LOCAL or even RACK_LOCAL.
 
 When these happen things get extremely slow.
 
 Does this mean that the executor got terminated and restarted?
 
 Is there a way to prevent this from happening (barring the machine actually 
 going down, I'd rather stick with the same process)?
 
 



workload for spark

2014-09-12 Thread 牛兆捷
We know some memory of spark are used for computing (e.g., shuffle buffer)
and some are used for caching RDD for future use.
Is there any existing workload which utilize both of them? I want to do
some performance study by adjusting the ratio between them.


Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
I know that unpersist is a method on RDD.
But my confusion is that, when we port our Scala programs to Spark, doesn't
everything change to RDDs?

On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
 that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You





Re: Spark and Scala

2014-09-12 Thread Hari Shreedharan
No, Scala primitives remain primitives. Unless you create an RDD using one
of the many methods - you would not be able to access any of the RDD
methods. There is no automatic porting. Spark is an application as far as
scala is concerned - there is no compilation (except of course, the scala,
JIT compilation etc).

On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
 that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
  wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run any
 Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You






Re: Spark and Scala

2014-09-12 Thread Deep Pradhan
Take for example this:
I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
scala line in the program. I actually want the queue to be an RDD but there
are no direct methods to create RDD which is a queue right? What say do you
have on this?
Does there exist something like: *Create and RDD which is a queue *?

On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan hshreedha...@cloudera.com
 wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using one
 of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw it
 out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You







Re: NullWritable not serializable

2014-09-12 Thread Matei Zaharia
Hi Du,

I don't think NullWritable has ever been serializable, so you must be doing 
something differently from your previous program. In this case though, just use 
a map() to turn your Writables to serializable types (e.g. null and String).

Matie

On September 12, 2014 at 8:48:36 PM, Du Li (l...@yahoo-inc.com.invalid) wrote:

Hi,

I was trying the following on spark-shell (built with apache master and hadoop 
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw 
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. 

I got the same problem in similar code of my app which uses the newly released 
Spark 1.1.0 under hadoop 2.4.0. Previously it worked fine with spark 1.0.2 
under either hadoop 2.40 and 0.23.10.

Anybody knows what caused the problem?

Thanks,
Du


import org.apache.hadoop.io.{NullWritable, Text}
val rdd = sc.textFile(README.md)
val res = rdd.map(x = (NullWritable.get(), new Text(x)))
res.saveAsSequenceFile(./test_data)
val rdd2 = sc.sequenceFile(./test_data, classOf[NullWritable], classOf[Text])
rdd2.collect
val rdd3 = sc.sequenceFile[NullWritable,Text](./test_data)
rdd3.collect




Re: Spark and Scala

2014-09-12 Thread Soumya Simanta
An RDD is a fault-tolerant distributed structure. It is the primary
abstraction in Spark.

I would strongly suggest that you have a look at the following to get a
basic idea.

http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
http://spark.apache.org/docs/latest/quick-start.html#basics
https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

On Sat, Sep 13, 2014 at 12:06 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Take for example this:
 I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
 scala line in the program. I actually want the queue to be an RDD but there
 are no direct methods to create RDD which is a queue right? What say do you
 have on this?
 Does there exist something like: *Create and RDD which is a queue *?

 On Sat, Sep 13, 2014 at 8:43 AM, Hari Shreedharan 
 hshreedha...@cloudera.com wrote:

 No, Scala primitives remain primitives. Unless you create an RDD using
 one of the many methods - you would not be able to access any of the RDD
 methods. There is no automatic porting. Spark is an application as far as
 scala is concerned - there is no compilation (except of course, the scala,
 JIT compilation etc).

 On Fri, Sep 12, 2014 at 8:04 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I know that unpersist is a method on RDD.
 But my confusion is that, when we port our Scala programs to Spark,
 doesn't everything change to RDDs?

 On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 unpersist is a method on RDDs. RDDs are abstractions introduced by
 Spark.

 An Int is just a Scala Int. You can't call unpersist on Int in Scala,
 and that doesn't change in Spark.

 On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 There is one thing that I am confused about.
 Spark has codes that have been implemented in Scala. Now, can we run
 any Scala code on the Spark framework? What will be the difference in the
 execution of the scala code in normal systems and on Spark?
 The reason for my question is the following:
 I had a variable
 *val temp = some operations*
 This temp was being created inside the loop, so as to manually throw
 it out of the cache, every time the loop ends I was calling
 *temp.unpersist()*, this was returning an error saying that *value
 unpersist is not a method of Int*, which means that temp is an Int.
 Can some one explain to me why I was not able to call *unpersist* on
 *temp*?

 Thank You








Re: Spark 1.1.0: Cannot load main class from JAR

2014-09-12 Thread Patrick Wendell
Hey SK,

Yeah, the documented format is the same (we expect users to add the
jar at the end) but the old spark-submit had a bug where it would
actually accept inputs that did not match the documented format. Sorry
if this was difficult to find!

- Patrick

On Fri, Sep 12, 2014 at 1:50 PM, SK skrishna...@gmail.com wrote:
 This issue is resolved. Looks like in the new spark-submit, the jar path has
 to be at the end of the options. Earlier I could specify this path in any
 order on the command line.

 thanks



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-1-0-Cannot-load-main-class-from-JAR-tp14123p14124.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.

Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping before
calling saveAsTextFIle, then use the input format to load them back.

Xiangrui

On Fri, Sep 12, 2014 at 7:43 PM, Mohit Jaggi mohitja...@gmail.com wrote:
 Folks,
 I think this might be due to the default TextInputFormat in Hadoop. Any
 pointers to solutions much appreciated.

 More powerfully, you can define your own InputFormat implementations to
 format the input to your programs however you want. For example, the default
 TextInputFormat reads lines of text files. The key it emits for each record
 is the byte offset of the line read (as a LongWritable), and the value is
 the contents of the line up to the terminating '\n' character (as a Text
 object). If you have multi-line records each separated by a $character, you
 could write your own InputFormat that parses files into records split on
 this character instead.


 Thanks,
 Mohit

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Looking for a good sample of Using Spark to do things Hadoop can do

2014-09-12 Thread Steve Lewis
Assume I have a large book with many Chapters and many lines of text.
 Assume I have a function that tells me the similarity of two lines of
text. The objective is to find the most similar line in the same chapter
within 200 lines of the line found.
The real problem involves biology and is beyond this discussion.

In the code shown below I convert Lines with location into a Tuple2 where
location is the key,

Now I want to partition by chapter (I think maybe that is right)

Now for every chapter I want to look at lines in order of location
 I want to keep the last 200 locations (as LineAndLocationMatch ) search
them to update the best fit and for every line add a best fit. When a line
is over 200 away from the current line it can be added ti the return
JavaRDD.

I know how to to the map and generate doubles but not how to do the sort
and reduce or even what the reduce function arguments look like.

Please use Java functions - not Lambdas as a sample- I am a strong typing
guy - returning JavaRDDs show me the type for a series of . operations and
really helps me understand what is happening

I expect my reduceFunction to look like
 void reduceFunction(KeyClass key,IteratorLineAndLocation values) but to
have some way to
accept the best fit LineAndLocationMatch  generated as values are iterated.
There is no reason to think that the number of objects will fit in memory.

Also it is important for the function doing the reduce to know the key.

I am very lost at what the reduce look like. Under the covers reduce
involves a lot of Java code which knows very little about spark and Hadoop.

My pseudo code looke like this - as far as I have working

// one line in the book
static class LineAndLocation  {
 int chapter;
 int lineNumber;
 String line;
}

// one line in the book
static class LineAndLocationMatch {
LineAndLocationMatch thisLine;
LineAndLocationMatch bestFit;
}

// location - acts as a key
static class KeyClass {
 int chapter;
 int lineNumber;

KeyClass(final int pChapter, final int pLineNumber) {
chapter = pChapter;
lineNumber = pLineNumber;
}
}

// used to compute the best fit
public class SimilarityFunction {
double getSimilarity(String s1,String s2)  {
return 0; // todo do work here
}
}

// This functions returns a RDD with best macth objects
public static JavaRDDLineAndLocationMatch
 findBestMatchesLikeHadoop(JavaRDDLineAndLocation inputs) {

// So this is what the mapper does - make key value pairs
JavaPairRDDKeyClass , LineAndLocation  mappedKeys =
inputs.mapToPair(new PairFunctionLineAndLocation, KeyClass,
LineAndLocation() {

   @Override public Tuple2KeyClass , LineAndLocation 
call(final LineAndLocation  v) throws Exception {
   return new Tuple2(new
KeyClass(v.chapter,v.lineNumber),v);
   }
   });

// Partition by chapters ?? is this right??
mappedKeys = mappedKeys.partitionBy(new Partitioner() {
@Override public int numPartitions() {
return 20;
}

@Override public int getPartition(final Object key) {
return ((KeyClass)key).chapter % numPartitions();
}
});

// Now I get very fuzzy - I for every partition I want sort on line
number
JavaPairRDDKeyClass , LineAndLocation  sortedKeys = ??? WHAT
HAPPENS HERE

// Now I need to to a reduce operation What I want is
JavaRDDLineAndLocationMatch bestMatches = sortedKeys.SOME
FUNCTION();

return bestMatches;
}


Re: compiling spark source code

2014-09-12 Thread qihong
follow the instruction here:
http://spark.apache.org/docs/latest/building-with-maven.html



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org