Spark assembly for YARN/CDH5

2014-10-16 Thread Philip Ogren
Does anyone know if there Spark assemblies are created and available for 
download that have been built for CDH5 and YARN?


Thanks,
Philip

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: creating a distributed index

2014-08-04 Thread Philip Ogren
After playing around with mapPartition I think this does exactly what I 
want.  I can pass in a function to mapPartition that looks like this:


def f1(iter: Iterator[String]): Iterator[MyIndex] = {
val idx: MyIndex = new MyIndex()
while (iter.hasNext) {
val text: String = iter.next;
idx.add(text)
}
List(idx).iterator
}

val indexRDD: RDD[MyIndex] = rdd.mapPartitions(f1, true)

Now I can perform a match operation like so:

val matchRdd: RDD[Match] = indexRDD.flatMap(index = index.`match`(myquery))

I'm sure it won't take much imagination to figure out how to the the 
matching in a batch way.


If anyone has done anything along these lines I'd love to have some 
feedback.


Thanks,
Philip



On 08/04/2014 09:46 AM, Philip Ogren wrote:
This looks like a really cool feature and it seems likely that this 
will be extremely useful for things we are doing.  However, I'm not 
sure it is quite what I need here.  With an inverted index you don't 
actually look items up by their keys but instead try to match against 
some input string.  So, if I created an inverted index on 1M 
strings/documents in a single JVM I could subsequently submit a 
string-valued query to retrieve the n-best matches.  I'd like to do 
something like build ten 1M string/document indexes across ten nodes, 
submit a string-valued query to each of the ten indexes and aggregate 
the n-best matches from the ten sets of results.  Would this be 
possible with IndexedRDD or some other feature of Spark?


Thanks,
Philip


On 08/01/2014 04:26 PM, Ankur Dave wrote:
At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com 
wrote:
It seems that I could do this with mapPartition so that each element 
in a

partition gets added to an index for that partition.
[...]
Would it then be possible to take a string and query each 
partition's index
with it? Or better yet, take a batch of strings and query each 
string in the

batch against each partition's index?
I proposed a key-value store based on RDDs called IndexedRDD that 
does exactly what you described. It uses mapPartitions to construct 
an index within each partition, then exposes get and multiget methods 
to allow looking up values associated with given keys.


It will hopefully make it into Spark 1.2.0. Until then you can try it 
out by merging in the pull request locally: 
https://github.com/apache/spark/pull/1297.


See JIRA for details and slides on how it works: 
https://issues.apache.org/jira/browse/SPARK-2365.


Ankur



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren
It is really nice that Spark RDD's provide functions  that are often 
equivalent to functions found in Scala collections.  For example, I can 
call:


myArray.map(myFx)

and equivalently

myRdd.map(myFx)

Awesome!

My question is this.  Is it possible to write code that works on either 
an RDD or a local collection without having to have parallel 
implementations?  I can't tell that RDD or Array share any supertypes or 
traits by looking at the respective scaladocs. Perhaps implicit 
conversions could be used here.  What I would like to do is have a 
single function whose body is like this:


myData.map(myFx)

where myData could be an RDD[Array[String]] (for example) or an 
Array[Array[String]].


Has anyone had success doing this?

Thanks,
Philip




Re: relationship of RDD[Array[String]] to Array[Array[String]]

2014-07-21 Thread Philip Ogren

Thanks Michael,

That is one solution that I had thought of.  It seems like a bit of 
overkill for the few methods I want to do this for - but I will think 
about it.  I guess I was hoping that I was missing something more 
obvious/easier.


Philip

On 07/21/2014 11:20 AM, Michael Malak wrote:

It's really more of a Scala question than a Spark question, but the standard OO 
(not Scala-specific) way is to create your own custom supertype (e.g. 
MyCollectionTrait), inherited/implemented by two concrete classes (e.g. MyRDD 
and MyArray), each of which manually forwards method calls to the corresponding 
pre-existing library implementations. Writing all those forwarding method calls 
is tedious, but Scala provides at least one bit of syntactic sugar, which 
alleviates having to type in twice the parameter lists for each method:
http://stackoverflow.com/questions/8230831/is-method-parameter-forwarding-possible-in-scala

I'm not seeing a way to utilize implicit conversions in this case. Since Scala 
is statically (albeit inferred) typed, I don't see a way around having a common 
supertype.



On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote:
It is really nice that Spark RDD's provide functions  that are often
equivalent to functions found in Scala collections.  For example, I can
call:

myArray.map(myFx)

and equivalently

myRdd.map(myFx)

Awesome!

My question is this.  Is it possible to write code that works on either
an RDD or a local collection without having to have parallel
implementations?  I can't tell that RDD or Array share any supertypes or
traits by looking at the respective scaladocs. Perhaps implicit
conversions could be used here.  What I would like to do is have a
single function whose body is like this:

myData.map(myFx)

where myData could be an RDD[Array[String]] (for example) or an
Array[Array[String]].

Has anyone had success doing this?

Thanks,
Philip




Re: Announcing Spark 1.0.1

2014-07-14 Thread Philip Ogren

Hi Patrick,

This is great news but I nearly missed the announcement because it had 
scrolled off the folder view that I have Spark users list messages go 
to.  40+ new threads since you sent the email out on Friday evening.


You might consider having someone on your team create a 
spark-announcement list so that it is easier to disseminate important 
information like this release announcement.


Thanks again for all your hard work.  I know you and the rest of the 
team are getting a million requests a day


Philip


On 07/11/2014 07:35 PM, Patrick Wendell wrote:

I am happy to announce the availability of Spark 1.0.1! This release
includes contributions from 70 developers. Spark 1.0.0 includes fixes
across several areas of Spark, including the core API, PySpark, and
MLlib. It also includes new features in Spark's (alpha) SQL library,
including support for JSON data and performance and stability fixes.

Visit the release notes[1] to read about this release or download[2]
the release today.

[1] http://spark.apache.org/releases/spark-release-1-0-1.html
[2] http://spark.apache.org/downloads.html




Multiple SparkContexts with different configurations in same JVM

2014-07-10 Thread Philip Ogren
In various previous versions of Spark (and I believe the current 
version, 1.0.0, as well) we have noticed that it does not seem possible 
to have a local SparkContext and a SparkContext connected to a cluster 
via either a Spark Cluster (i.e. using the Spark resource manager) or a 
YARN cluster.  Is this a known issue?  If not, then I would be happy to 
write up a bug report on what the bad/unexpected behavior is and how to 
reproduce it.  If so, then are there plans to fix this?  Perhaps there 
is a Jira issue I could be pointed to or other related discussion.


Thanks,
Philip


Re: Unit test failure: Address already in use

2014-06-18 Thread Philip Ogren
In my unit tests I have a base class that all my tests extend that has a 
setup and teardown method that they inherit.  They look something like this:


var spark: SparkContext = _

@Before
def setUp() {
Thread.sleep(100L) //this seems to give spark more time to 
reset from the previous test's tearDown

spark = new SparkContext(local, test spark)
}

@After
def tearDown() {
spark.stop
spark = null //not sure why this helps but it does!
System.clearProperty(spark.master.port)
   }


It's been since last fall (i.e. version 0.8.x) since I've examined this 
code and so I can't vouch that it is still accurate/necessary - but it 
still works for me.



On 06/18/2014 12:59 PM, Lisonbee, Todd wrote:


Disabling parallelExecution has worked for me.

Other alternatives I’ve tried that also work include:

1. Using a lock – this will let tests execute in parallel except for 
those using a SparkContext.  If you have a large number of tests that 
could execute in parallel, this can shave off some time.


object TestingSparkContext {

val lock = new Lock()

}

// before you instantiate your local SparkContext

TestingSparkContext.lock.acquire()

// after you call sc.stop()

TestingSparkContext.lock.release()

2. Sharing a local SparkContext between tests.

- This is nice because your tests will run faster.  Start-up and 
shutdown is time consuming (can add a few seconds per test).


- The downside is that your tests are using the same SparkContext so 
they are less independent of each other.  I haven’t seen issues with 
this yet but there are likely some things that might crop up.


Best,

Todd

*From:*Anselme Vignon [mailto:anselme.vig...@flaminem.com]
*Sent:* Wednesday, June 18, 2014 12:33 AM
*To:* user@spark.apache.org
*Subject:* Re: Unit test failure: Address already in use

Hi,

Could your problem come from the fact that you run your tests in 
parallel ?


If you are spark in local mode, you cannot have concurrent spark 
instances running. this means that your tests instantiating 
sparkContext cannot be run in parallel. The easiest fix is to tell sbt 
to not run parallel tests.


This can be done by adding the following line in your build.sbt:

parallelExecution in Test := false

Cheers,

Anselme

2014-06-17 23:01 GMT+02:00 SK skrishna...@gmail.com 
mailto:skrishna...@gmail.com:


Hi,

I have 3 unit tests (independent of each other) in the /src/test/scala
folder. When I run each of them individually using: sbt test-only
test,
all the 3 pass the test. But when I run them all using sbt test,
then they
fail with the warning below. I am wondering if the binding
exception results
in failure to run the job, thereby causing the failure. If so,
what can I do
to address this binding exception? I am running these tests
locally on a
standalone machine (i.e. SparkContext(local, test)).


14/06/17 13:42:48 WARN component.AbstractLifeCycle: FAILED
org.eclipse.jetty.server.Server@3487b78d
mailto:org.eclipse.jetty.server.Server@3487b78d:
java.net.BindException: Address
already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:174)
at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:139)
at
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:77)


thanks



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Unit-test-failure-Address-already-in-use-tp7771.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.





Re: Processing audio/video/images

2014-06-02 Thread Philip Ogren
I asked a question related to Marcelo's answer a few months ago. The 
discussion there may be useful:


http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html


On 06/02/2014 06:09 PM, Marcelo Vanzin wrote:

Hi Jamal,

If what you want is to process lots of files in parallel, the best
approach is probably to load all file names into an array and
parallelize that. Then each task will take a path as input and can
process it however it wants.

Or you could write the file list to a file, and then use sc.textFile()
to open it (assuming one path per line), and the rest is pretty much
the same as above.

It will probably be hard to process each individual file in parallel,
unless mp3 and jpg files can be split into multiple blocks that can be
processed separately. In that case, you'd need a custom (Hadoop) input
format that is able to calculate the splits. But it doesn't sound like
that's what you want.



On Mon, Jun 2, 2014 at 5:02 PM, jamal sasha jamalsha...@gmail.com wrote:

Hi,
   How do one process for data sources other than text?
Lets say I have millions of mp3 (or jpeg) files and I want to use spark to
process them?
How does one go about it.


I have never been able to figure this out..
Lets say I have this library in python which works like following:

import audio

song = audio.read_mp3(filename)

Then most of the methods are attached to song or maybe there is another
function which takes song type as an input.

Maybe the above is just rambling.. but how do I use spark to process (say)
audiio files.
Thanks







Re: Use SparkListener to get overall progress of an action

2014-05-23 Thread Philip Ogren

Hi Pierre,

I asked a similar question on this list about 6 weeks ago.  Here is one 
answer 
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E 
I got that is of particular note:


In the upcoming release of Spark 1.0 there will be a feature that 
provides for exactly what you describe: capturing the information 
displayed on the UI in JSON. More details will be provided in the 
documentation, but for now, anything before 0.9.1 can only go through 
JobLogger.scala, which outputs information in a somewhat arbitrary 
format and will be deprecated soon. If you find this feature useful, you 
can test it out by building the master branch of Spark yourself, 
following the instructions in https://github.com/apache/spark/pull/42.




On 05/22/2014 08:51 AM, Pierre B wrote:

Is there a simple way to monitor the overall progress of an action using
SparkListener or anything else?

I see that one can name an RDD... Could that be used to determine which
action triggered a stage, ... ?


Thanks

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-SparkListener-to-get-overall-progress-of-an-action-tp6256.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true?  I have found Spark local mode 
to be quite good about blowing up if there is something non-serializable 
and so my unit tests have been great for detecting this.  I have never 
seen something that worked in local mode that didn't work on the cluster 
because of different serialization requirements between the two.  
Perhaps it is different when using Kryo



On 05/14/2014 04:34 AM, Andras Nemeth wrote:
E.g. if I accidentally use a closure which has something 
non-serializable in it, then my test will happily succeed in local 
mode but go down in flames on a real cluster.




Re: Opinions stratosphere

2014-05-02 Thread Philip Ogren
Great reference!  I just skimmed through the results without reading 
much of the methodology - but it looks like Spark outperforms 
Stratosphere fairly consistently in the experiments.  It's too bad the 
data sources only range from 2GB to 8GB.  Who knows if the apparent 
pattern would extend out to 64GB, 128GB, 1TB, and so on...




On 05/01/2014 06:02 PM, Christopher Nguyen wrote:
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually 
attempted such a comparative study as a Masters thesis:


http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf

According to this snapshot (c. 2013), Stratosphere is different from 
Spark in not having an explicit concept of an in-memory dataset (e.g., 
RDD).


In principle this could be argued to be an implementation detail; the 
operators and execution plan/data flow are of primary concern in the 
API, and the data representation/materializations are otherwise 
unspecified.


But in practice, for long-running interactive applications, I consider 
RDDs to be of fundamental, first-class citizen importance, and the key 
distinguishing feature of Spark's model vs other in-memory 
approaches that treat memory merely as an implicit cache.


--
Christopher T. Nguyen
Co-founder  CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen http://linkedin.com/in/ctnguyen



On Tue, Nov 26, 2013 at 1:26 PM, Matei Zaharia 
matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote:


I don’t know a lot about it except from the research side, where
the team has done interesting optimization stuff for these types
of applications. In terms of the engine, one thing I’m not sure of
is whether Stratosphere allows explicit caching of datasets
(similar to RDD.cache()) and interactive queries (similar to
spark-shell). But it’s definitely an interesting project to watch.

Matei

On Nov 22, 2013, at 4:17 PM, Ankur Chauhan
achau...@brightcove.com mailto:achau...@brightcove.com wrote:

 Hi,

 That's what I thought but as per the slides on
http://www.stratosphere.eu they seem to know about spark and the
scala api does look similar.
 I found the PACT model interesting. Would like to know if matei
or other core comitters have something to weight in on.

 -- Ankur
 On 22 Nov 2013, at 16:05, Patrick Wendell pwend...@gmail.com
mailto:pwend...@gmail.com wrote:

 I've never seen that project before, would be interesting to get a
 comparison. Seems to offer a much lower level API. For instance
this
 is a wordcount program:



https://github.com/stratosphere/stratosphere/blob/master/pact/pact-examples/src/main/java/eu/stratosphere/pact/example/wordcount/WordCount.java

 On Thu, Nov 21, 2013 at 3:15 PM, Ankur Chauhan
achau...@brightcove.com mailto:achau...@brightcove.com wrote:
 Hi,

 I was just curious about
https://github.com/stratosphere/stratosphere
 and how does spark compare to it. Anyone has any experience
with it to make
 any comments?

 -- Ankur







RDD.tail()

2014-04-14 Thread Philip Ogren
Has there been any thought to adding a tail() method to RDD?  It would 
be really handy to skip over the first item in an RDD when it contains 
header information.  Even better would be a drop(int) function that 
would allow you to skip over several lines of header information.  Our 
attempts to do something equivalent with a filter() call seem a bit 
contorted.  Any thoughts?


Thanks,
Philip


Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
This is great news thanks for the update!  I will either wait for the 
1.0 release or go and test it ahead of time from git rather than trying 
to pull it out of JobLogger or creating my own SparkListener.



On 04/02/2014 06:48 PM, Andrew Or wrote:

Hi Philip,

In the upcoming release of Spark 1.0 there will be a feature that 
provides for exactly what you describe: capturing the information 
displayed on the UI in JSON. More details will be provided in the 
documentation, but for now, anything before 0.9.1 can only go through 
JobLogger.scala, which outputs information in a somewhat arbitrary 
format and will be deprecated soon. If you find this feature useful, 
you can test it out by building the master branch of Spark yourself, 
following the instructions in https://github.com/apache/spark/pull/42.


Andrew


On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


What I'd like is a way to capture the information provided on the
stages page (i.e. cluster:4040/stages via IndexPage).  Looking
through the Spark code, it doesn't seem like it is possible to
directly query for specific facts such as how many tasks have
succeeded or how many total tasks there are for a given active
stage.  Instead, it looks like all the data for the page is
generated at once using information from the JobProgressListener.
It doesn't seem like I have any way to programmatically access
this information myself.  I can't even instantiate my own
JobProgressListener because it is spark package private.  I could
implement my SparkListener and gather up the information myself.
 It feels a bit awkward since classes like Task and TaskInfo are
also spark package private.  It does seem possible to gather up
what I need but it seems like this sort of information should just
be available without by implementing a custom SparkListener (or
worse screen scraping the html generated by StageTable!)

I was hoping that I would find the answer in MetricsServlet which
is turned on by default.  It seems that when I visit
http://cluster:4040/metrics/json/ I should be able to get
everything I want but I don't see the basic stage/task progress
information I would expect.  Are there special metrics properties
that I should set to get this info?  I think this would be the
best solution - just give it the right URL and parse the resulting
JSON - but I can't seem to figure out how to do this or if it is
possible.

Any advice is appreciated.

Thanks,
Philip



On 04/01/2014 09:43 AM, Philip Ogren wrote:

Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your own
investigation.  Any findings would be appreciated.

Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show
users the current progress of map and reduce job. After
looking at the api document, I don't find anything for
this. However, in Spark UI, I could see the progress of
the task. Is there anything I miss?

Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/








Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
I can appreciate the reluctance to expose something like the 
JobProgressListener as a public interface.  It's exactly the sort of 
thing that you want to deprecate as soon as something better comes along 
and can be a real pain when trying to maintain the level of backwards 
compatibility  that we all expect from commercial grade software.  
Instead of simply marking it private and therefore unavailable to Spark 
developers, it might be worth incorporating something like a @Beta 
annotation 
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html 
which you could sprinkle liberally throughout Spark that communicates 
hey use this if you want to cause its here now and don't come crying 
if we rip it out or change it later.  This might be better than simply 
marking so many useful functions/classes as private.  I bet such an 
annotation could generate a compile warning/error for those who don't 
want to risk using them.



On 04/02/2014 06:40 PM, Patrick Wendell wrote:

Hey Phillip,

Right now there is no mechanism for this. You have to go in through 
the low level listener interface.


We could consider exposing the JobProgressListener directly - I think 
it's been factored nicely so it's fairly decoupled from the UI. The 
concern is this is a semi-internal piece of functionality and 
something we might, e.g. want to change the API of over time.


- Patrick


On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com 
mailto:philip.og...@oracle.com wrote:


What I'd like is a way to capture the information provided on the
stages page (i.e. cluster:4040/stages via IndexPage).  Looking
through the Spark code, it doesn't seem like it is possible to
directly query for specific facts such as how many tasks have
succeeded or how many total tasks there are for a given active
stage.  Instead, it looks like all the data for the page is
generated at once using information from the JobProgressListener.
It doesn't seem like I have any way to programmatically access
this information myself.  I can't even instantiate my own
JobProgressListener because it is spark package private.  I could
implement my SparkListener and gather up the information myself.
 It feels a bit awkward since classes like Task and TaskInfo are
also spark package private.  It does seem possible to gather up
what I need but it seems like this sort of information should just
be available without by implementing a custom SparkListener (or
worse screen scraping the html generated by StageTable!)

I was hoping that I would find the answer in MetricsServlet which
is turned on by default.  It seems that when I visit
http://cluster:4040/metrics/json/ I should be able to get
everything I want but I don't see the basic stage/task progress
information I would expect.  Are there special metrics properties
that I should set to get this info?  I think this would be the
best solution - just give it the right URL and parse the resulting
JSON - but I can't seem to figure out how to do this or if it is
possible.

Any advice is appreciated.

Thanks,
Philip



On 04/01/2014 09:43 AM, Philip Ogren wrote:

Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your own
investigation.  Any findings would be appreciated.

Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show
users the current progress of map and reduce job. After
looking at the api document, I don't find anything for
this. However, in Spark UI, I could see the progress of
the task. Is there anything I miss?

Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/








Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Philip Ogren
What I'd like is a way to capture the information provided on the stages 
page (i.e. cluster:4040/stages via IndexPage).  Looking through the 
Spark code, it doesn't seem like it is possible to directly query for 
specific facts such as how many tasks have succeeded or how many total 
tasks there are for a given active stage.  Instead, it looks like all 
the data for the page is generated at once using information from the 
JobProgressListener. It doesn't seem like I have any way to 
programmatically access this information myself.  I can't even 
instantiate my own JobProgressListener because it is spark package 
private.  I could implement my SparkListener and gather up the 
information myself.  It feels a bit awkward since classes like Task and 
TaskInfo are also spark package private.  It does seem possible to 
gather up what I need but it seems like this sort of information should 
just be available without by implementing a custom SparkListener (or 
worse screen scraping the html generated by StageTable!)


I was hoping that I would find the answer in MetricsServlet which is 
turned on by default.  It seems that when I visit 
http://cluster:4040/metrics/json/ I should be able to get everything I 
want but I don't see the basic stage/task progress information I would 
expect.  Are there special metrics properties that I should set to get 
this info?  I think this would be the best solution - just give it the 
right URL and parse the resulting JSON - but I can't seem to figure out 
how to do this or if it is possible.


Any advice is appreciated.

Thanks,
Philip


On 04/01/2014 09:43 AM, Philip Ogren wrote:

Hi DB,

Just wondering if you ever got an answer to your question about 
monitoring progress - either offline or through your own 
investigation.  Any findings would be appreciated.


Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:

Hi guys,

When we're running a very long job, we would like to show users the 
current progress of map and reduce job. After looking at the api 
document, I don't find anything for this. However, in Spark UI, I 
could see the progress of the task. Is there anything I miss?


Thanks.

Sincerely,

DB Tsai
Machine Learning Engineer
Alpine Data Labs
--
Web: http://alpinenow.com/