Flashback: RDD.aggregate versus accumulables...

2016-03-10 Thread jiml
And Lord Joe you were right future versions did protect accumulators in
actions. I wonder if anyone has a "modern" take on the accumulator vs.
aggregate question. Seems like if I need to do it by key or control
partitioning I would use aggregate.

Bottom line question / reason for post: I wonder if anyone has more ideas
about using aggregate instead? Am I right to think accumulables are always
present on the driver, whereas an aggregate needs to be pulled to the driver
manually?

Details: 

But they both give me an option to write custom adds and merges:
For example this class I am stubbing out:

class DropEvalAccumulableParam implements
AccumulableParam {

// Add additional data to the accumulator value. Is allowed to
modify and return r for efficiency (to avoid allocating objects).
// r is the first value
@Override
public DropEvaluation addAccumulator(DropEvaluation dropEvaluation,
DropResult dropResult) {
return null;
}

// Merge two accumulated values together. Is allowed to modify and
return the first value for efficiency (to avoid allocating objects).
@Override
public DropEvaluation addInPlace(DropEvaluation masterDropEval,
DropEvaluation r1) {
return null;
}

// Return the "zero" (identity) value for an accumulator type, given
its initial value. For example, if R was a vector of N dimensions,
// this would return a vector of N zeroes.
@Override
public DropEvaluation zero(DropEvaluation dropEvaluation) {
// technically the "additive identity" of a DropEvaluation would
be


return dropEvaluation;
}
}





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p26456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Various ways to use --jars? Some undocumented ways?

2016-01-11 Thread jiml
(Sorry to "repost" I originally answered/replied to an older question but my
part was not expanding)

Question is: Looking for all the ways to specify a set of jars using --jars
on spark-submit? I know this is old but I am about to submit a proposed docs
change on --jars, and I had an issue with --jars today. 

In an older question, when this user submitted the following command line

 
, is that a proper way to reference a jar?

hdfs://master:8000/srcdata/kmeans  (is that a directory? or a jar that
doesn't end with .jar? I have not gotten into the machine learning libs yet
to recognize this)

I know the docs say, "Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes."

So this application-jar can point to a directory and will be expanded? Or
needs to be a path to a single specific jar?

I ask because when I was testing --jars today, we had to explicitly provide
a path to each jar:

/usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData
---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar
/usr/local/spark/jars/thold-0.0.1-1.jar

(The only way I figured out to use the commas was a StackOverflow answer
that led me to look beyond the docs to the command line: spark-submit --help
results in :

 --jars JARS Comma-separated list of local jars to include
on the driver
  and executor classpaths.


And it seems that we do not need to put the main jar in the --jars argument,
I have not tested yet if other classes in the application-jar
(/usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I
need to put the application-jar in the --jars path to get classes not named
after --class to be seen?

Thanks for any ideas 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Various-ways-to-use-jars-Some-undocumented-ways-tp25943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: how to submit multiple jar files when using spark-submit script in shell?

2016-01-11 Thread jiml
Question is: Looking for all the ways to specify a set of jars using --jars
on spark-submit

I know this is old but I am about to submit a proposed docs change on
--jars, and I had an issue with --jars today

When this user submitted the following command line, is that a proper way to
reference a jar?

hdfs://master:8000/srcdata/kmeans  (is that a directory? or a jar that
doesn't end with .jar? I have not gotten into the machine learning libs yet
to recognize this)

I know the docs say, "Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes."

*So this application-jar can point to a directory and will be expanded? Or
needs to be a path to a single specific jar?*

I ask because when I was testing --jars today, we had to explicitly provide
a path to each jar:

//usr/local/spark/bin/spark-submit --class jpsgcs.thold.PipeLinkageData
---jars=local:/usr/local/spark/jars/groovy-all-2.3.3.jar,local:/usr/local/spark/jars/guava-14.0.1.jar,local:/usr/local/spark/jars/jopt-simple-4.6.jar,local:/usr/local/spark/jars/jpsgcs-core-1.0.8-2.jar,local:/usr/local/spark/jars/jpsgcs-pipe-1.0.6-7.jar
/usr/local/spark/jars/thold-0.0.1-1.jar/

(The only way I figured out to use the commas was a StackOverflow answer
that led me to look beyond the docs to the command line: spark-submit --help
results in :

 --jars JARS Comma-separated list of local jars to include
on the driver
  and executor classpaths.


And it seems that we do not need to put the main jar in the --jars argument,
I have not tested yet if other classes in the application-jar
(/usr/local/spark/jars/thold-0.0.1-1.jar) are shipped to workers, or if I
need to put the application-jar in the --jars path to get classes not named
after --class to be seen?

Thanks for any ideas




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-submit-multiple-jar-files-when-using-spark-submit-script-in-shell-tp16662p25942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread jiml
(point of post is to see if anyone has ideas about errors at end of post)

In addition, the real way to test if it's working is to force serialization:

In Java:

Create array of all your classes:
// for kyro serializer it wants to register all classes that need to be
serialized
Class[] kryoClassArray = new Class[]{DropResult.class, DropEvaluation.class,
PrintHetSharing.class};

in the builder for your SparkConf (or in conf/spark-defaults.sh)
.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
//require registration of all classes with Kyro
.set("spark.kryo.registrationRequired", "true")
// don't forget to register ALL classes or will get error
.registerKryoClasses(kryoClassArray);

Then you will start to get neat errors like the one I am working on:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Failed to serialize task 0, not attempting to retry it.
Exception during serialization: java.io.IOException:
java.lang.IllegalArgumentException: Class is not registered:
scala.collection.mutable.WrappedArray$ofRef
Note: To register this class use:
kryo.register(scala.collection.mutable.WrappedArray$ofRef.class);

I did try adding scala.collection.mutable.WrappedArray to the Class array up
top but no luck. Thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/kryos-serializer-tp16454p25921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark submit does automatically upload the jar to cluster?

2015-12-29 Thread jiml
And for more clarification on this:

For non-YARN installs this bug has been filed to make the Spark driver
upload jars   

The point of confusion, that I along with other newcomers commonly suffer
from is this. In non-YARN installs:

*The **driver** does NOT push your jars to the cluster. The **master**
in the cluster DOES push your jars to the **workers**. In theory.*

Thanks to an email response on the email list from Greg Hill for this
clarification, hope he doesn't mind me copying the relevant part here, since
I can't link to it:

" spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers."
That's funny I didn't delete that answer!

I think I have two accounts crossing, here was the answer:

I don't know if this is going to help, but I agree that some of the docs
would lead one to believe that the Spark driver  or master is going to
spread your jars around for you. But there's other docs that seem to
contradict this, esp related to EC2 clusters.

I wrote a Stack Overflow answer dealing with a similar situation, see if it
helps:

http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster/34502774#34502774

Pay attention to this section about the spark-submit docs:

I must admit, as a limitation on this, it confuses me in the Spark docs that
for spark.executor.extraClassPath it says:

Users typically should not need to set this option

I assume they mean most people will get the classpath out through a driver
config option. I know most of the docs for spark-submit make it should like
the script handles moving your code around the cluster but I think it only
moves the classpath around for you. For example is this line from  Launching
Applications with spark-submit

  
explicitly says you have to move the jars yourself or make them "globally
available":

application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-does-automatically-upload-the-jar-to-cluster-tp25762p25831.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-28 Thread jiml
Hi, a couple-three things. First, is this a Gradle project? SBT? Regardless
of the answer, convince yourself that you are getting this error from the
command line before doing anything else. Eclipse is awesome and it's also
really glitchy, I have seen too many times recently where something funky is
happening in Eclipse but I can go to the shell and "gradle build" and
"gradle run" things just fine.

Getting that out of the way, and I don't know yet how generally applicable
this idea is, get rid of ALL hostnames and try with just IP adresses. I
posted the results of some research I did this morning on SO:

http://stackoverflow.com/questions/28453835/apache-sparck-error-could-not-connect-to-akka-tcp-sparkmaster/34499020#34499020

Note that what I focus on is getting all spurious config out of the way.
Comment out all configs in spark-defaults.conf and sparv-env.sh that refer
to IP or Master config, just do only this: On the master, in spark-env.sh,
set the SPARK_MASTER_IP to the IP address, not hostname. Then use IP
addresses in your call to Spark Context. See what happens.

I know what you are seeing is two different bits of code working differently
but I would bet it's an underlying Spark config issue. The important part is
the master log which clearly identifies a network problem. As noted in my SO
post, there's a bug out there that leads me to always use IP addresses but I
am not sure how widely applicable that answer is :)

If that doesn't work, please post what is the different between "WordCount
MapReduce job"  and "Spark Wordcount" -- that's not clear to me. Post your
SparkConf and Spark Context calls.

JimL


   I'm new to Spark. Before I describe the problem, I'd like to let you know
the role of the machines that organize the cluster and the purpose of my
work. By reading and follwing the instructions and tutorials, I successfully
built up a cluster with 7 CentOS-6.5 machines. I installed Hadoop 2.7.1,
Spark 1.5.1, Scala 2.10.4 and ZooKeeper 3.4.5 on them. The details are
listed as below:


 As all the other guys in our group are in the habit of eclipse on Windows,
I'm trying to work on this. I have successfully submitted the WordCount
MapReduce job to YARN and it run smoothly through eclipse and Windows. But
when I tried to run the Spark WordCount, it gives me the following error in
the eclipse console:

...

15/12/23 11:15:33 ERROR ErrorMonitor: dropping message [class
akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://sparkMaster@10.20.17.70:7077/]] arriving at
[akka.tcp://sparkMaster@10.20.17.70:7077] inbound addresses are
[akka.tcp://sparkMaster@hadoop00:7077]
akka.event.Logging$Error$NoCause$
15/12/23 11:15:53 INFO Master: 10.20.6.23:56374 got disassociated, removing
it.
15/12/23 11:15:53 INFO Master: 10.20.6.23:56374 got disassociated, removing
it.
15/12/23 11:15:53 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkDriver@10.20.6.23:56374] has failed, address is now
gated for [5000] ms. Reason: [Disassociated] 
...

   object WordCount{
  def main(args: Array[String]){
val conf = new SparkConf().setAppName("Scala
WordCount").setMaster("spark://10.20.17.70:7077").setJars(List("C:\\Temp\\test.jar"));
val sc = new SparkContext(conf);
val textFile = sc.textFile("hdfs://10.20.17.70:9000/wc/indata/wht.txt");
textFile.flatMap(_.split(" ")).map((_,
1)).reduceByKey(_+_).collect().foreach(println);
  }
} 

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-of-submitting-Spark-task-to-cluster-from-eclipse-IDE-on-Windows-tp25778p25825.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark submit does automatically upload the jar to cluster?

2015-12-28 Thread jiml
That's funny I didn't delete that answer!

I think I have two accounts crossing, here was the answer:

I don't know if this is going to help, but I agree that some of the docs
would lead one to believe that the Spark driver  or master is going to
spread your jars around for you. But there's other docs that seem to
contradict this, esp related to EC2 clusters.

I wrote a Stack Overflow answer dealing with a similar situation, see if it
helps:

http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster/34502774#34502774

Pay attention to this section about the spark-submit docs:

I must admit, as a limitation on this, it confuses me in the Spark docs that
for spark.executor.extraClassPath it says:

Users typically should not need to set this option

I assume they mean most people will get the classpath out through a driver
config option. I know most of the docs for spark-submit make it should like
the script handles moving your code around the cluster but I think it only
moves the classpath around for you. For example is this line from  Launching
Applications with spark-submit

  
explicitly says you have to move the jars yourself or make them "globally
available":

application-jar: Path to a bundled jar including your application and
all dependencies. The URL must be globally visible inside of your cluster,
for instance, an hdfs:// path or a file:// path that is present on all
nodes.









--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-does-automatically-upload-the-jar-to-cluster-tp25762p25826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARK_CLASSPATH out, spark.executor.extraClassPath in?

2015-12-28 Thread jiml
I looked into this a lot more and posted an answer to a similar question on
SO, but it's EC2 specific. Still might be some useful info in there and any
comments/corrections/improvements would be greatly appreciated!

http://stackoverflow.com/questions/23687081/spark-workers-unable-to-find-jar-on-ec2-cluster/34502774#34502774
answer from today by me



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-CLASSPATH-out-spark-executor-extraClassPath-in-tp25812p25823.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org