Re: JavaRDD using Reflection

2015-09-14 Thread Ajay Singal
Hello Rachana,

The easiest way would be to start with creating a 'parent' JavaRDD and run
different filters (based on different input arguments) to create respective
'child' JavaRDDs dynamically.

Notice that the creation of these children RDDs is handled by the
application driver.

Hope this helps!
Ajay

On Mon, Sep 14, 2015 at 1:21 PM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi Rachana
>
> I didn't get you r question fully but as the error says you can not
> perform a rdd transformation or action inside another transformation. In
> your example you are performing an action "rdd2.values.count()" in side
> the "map" transformation. It is not allowed and in any case this will be
> very inefficient too.
>
> you should do something like this:
>
> final long rdd2_count = rdd2.values.count()
> rdd1.map(x => rdd2_count * x)
>
> Hope this helps!!
>
> - Ankur
>
> On Mon, Sep 14, 2015 at 9:54 AM, Rachana Srivastava <
> rachana.srivast...@markmonitor.com> wrote:
>
>> Hello all,
>>
>>
>>
>> I am working a problem that requires us to create different set of
>> JavaRDD based on different input arguments.  We are getting following error
>> when we try to use a factory to create JavaRDD.  Error message is clear but
>> I am wondering is there any workaround.
>>
>>
>>
>> *Question: *
>>
>> How to create different set of JavaRDD based on different input arguments
>> dynamically.  Trying to implement something like factory pattern.
>>
>>
>>
>> *Error Message: *
>>
>> RDD transformations and actions can only be invoked by the driver, not
>> inside of other transformations; for example, rdd1.map(x =>
>> rdd2.values.count() * x) is invalid because the values transformation and
>> count action cannot be performed inside of the rdd1.map transformation. For
>> more information, see SPARK-5063.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Rachana
>>
>
>


Re: Controlling number of executors on Mesos vs YARN

2015-08-13 Thread Ajay Singal
Hi Tim,

An option like spark.mesos.executor.max to cap the number of executors per
node/application would be very useful.  However, having an option like
spark.mesos.executor.num
to specify desirable number of executors per node would provide even/much
better control.

Thanks,
Ajay

On Wed, Aug 12, 2015 at 4:18 AM, Tim Chen t...@mesosphere.io wrote:

 Yes the options are not that configurable yet but I think it's not hard to
 change it.

 I have a patch out actually specifically able to configure amount of cpus
 per executor in coarse grain mode, and hopefully merged next release.

 I think the open question now is for fine grain mode can we limit the
 number of maximum concurrent executors, and I think we can definitely just
 add a new option like spark.mesos.executor.max to cap it.

 I'll file a jira and hopefully to get this change in soon too.

 Tim



 On Tue, Aug 11, 2015 at 6:21 AM, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 Spark evolved as an example framework for Mesos - thats how I know it. It
 is surprising to see that the options provided by mesos in this case are
 less. Tweaking the source code, haven't done it yet but I would love to see
 what options could be there!

 On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam chiling...@gmail.com wrote:

 My experience with Mesos + Spark is not great. I saw one executor with
 30 CPU and the other executor with 6. So I don't think you can easily
 configure it without some tweaking at the source code.

 Sent from my iPad

 On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 Hi Tim,

 Spark on Yarn allows us to do it using --num-executors and
 --executor_cores commandline arguments. I just got a chance to look at a
 similar spark user list mail, but no answer yet. So does mesos allow
 setting the number of executors and cores? Is there a default number it
 assumes?

 On Mon, Jan 5, 2015 at 5:07 PM, Tim Chen t...@mesosphere.io wrote:

 Forgot to hit reply-all.

 -- Forwarded message --
 From: Tim Chen t...@mesosphere.io
 Date: Sun, Jan 4, 2015 at 10:46 PM
 Subject: Re: Controlling number of executors on Mesos vs YARN
 To: mvle m...@us.ibm.com


 Hi Mike,

 You're correct there is no such setting in for Mesos coarse grain mode,
 since the assumption is that each node is launched with one container and
 Spark is launching multiple tasks in that container.

 In fine-grain mode there isn't a setting like that, as it currently
 will launch an executor as long as it satisfies the minimum container
 resource requirement.

 I've created a JIRA earlier about capping the number of executors or
 better distribute the # of executors launched in each node. Since the
 decision of choosing what node to launch containers is all in the Spark
 scheduler side, it's very easy to modify it.

 Btw, what's the configuration to set the # of executors on YARN side?

 Thanks,

 Tim



 On Sun, Jan 4, 2015 at 9:37 PM, mvle m...@us.ibm.com wrote:

 I'm trying to compare the performance of Spark running on Mesos vs
 YARN.
 However, I am having problems being able to configure the Spark
 workload to
 run in a similar way on Mesos and YARN.

 When running Spark on YARN, you can specify the number of executors per
 node. So if I have a node with 4 CPUs, I can specify 6 executors on
 that
 node. When running Spark on Mesos, there doesn't seem to be an
 equivalent
 way to specify this. In Mesos, you can somewhat force this by
 specifying the
 number of CPU resources to be 6 when running the slave daemon.
 However, this
 seems to be a static configuration of the Mesos cluster rather
 something
 that can be configured in the Spark framework.

 So here is my question:

 For Spark on Mesos, am I correct that there is no way to control the
 number
 of executors per node (assuming an idle cluster)? For Spark on Mesos
 coarse-grained mode, there is a way to specify max_cores but that is
 still
 not equivalent to specifying the number of executors per node as when
 Spark
 is run on YARN.

 If I am correct, then it seems Spark might be at a disadvantage
 running on
 Mesos compared to YARN (since it lacks the fine tuning ability
 provided by
 YARN).

 Thanks,
 Mike



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Controlling-number-of-executors-on-Mesos-vs-YARN-tp20966.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






 --
 Regards,
 Haripriya Ayyalasomayajula




 --
 Regards,
 Haripriya Ayyalasomayajula





Re: Controlling number of executors on Mesos vs YARN

2015-08-13 Thread Ajay Singal
Tim,

The ability to specify fine-grain configuration could be useful for many
reasons.  Let's take an example of a node with 32 cores.  All of first, as
per my understanding, having 5 executors each with 6 cores will almost
always perform better than having a single executor with 30 cores .  Also,
these 5 executors could be a) used by the same application, or b) shared
amongst multiple applications.  In case of single executor with 30 cores,
some of the slots/core could be wasted if there are less number of Tasks
(from a single application) to be executed.

As I said, applications can specify desirable number of executors.  If not
available, Mesos (in a simple implementation) can provide/offer whatever is
available. In a slightly complex implementation, we can build a simple
protocol to negotiate.

Regards,
Ajay

On Wed, Aug 12, 2015 at 5:51 PM, Tim Chen t...@mesosphere.io wrote:

 You're referring to both fine grain and coarse grain?

 Desirable number of executors per node could be interesting but it can't
 be guaranteed (or we could try to and when failed abort the job).

 How would you imagine this new option to actually work?


 Tim

 On Wed, Aug 12, 2015 at 11:48 AM, Ajay Singal asinga...@gmail.com wrote:

 Hi Tim,

 An option like spark.mesos.executor.max to cap the number of executors
 per node/application would be very useful.  However, having an option like 
 spark.mesos.executor.num
 to specify desirable number of executors per node would provide even/much
 better control.

 Thanks,
 Ajay

 On Wed, Aug 12, 2015 at 4:18 AM, Tim Chen t...@mesosphere.io wrote:

 Yes the options are not that configurable yet but I think it's not hard
 to change it.

 I have a patch out actually specifically able to configure amount of
 cpus per executor in coarse grain mode, and hopefully merged next release.

 I think the open question now is for fine grain mode can we limit the
 number of maximum concurrent executors, and I think we can definitely just
 add a new option like spark.mesos.executor.max to cap it.

 I'll file a jira and hopefully to get this change in soon too.

 Tim



 On Tue, Aug 11, 2015 at 6:21 AM, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 Spark evolved as an example framework for Mesos - thats how I know it.
 It is surprising to see that the options provided by mesos in this case are
 less. Tweaking the source code, haven't done it yet but I would love to see
 what options could be there!

 On Tue, Aug 11, 2015 at 5:42 AM, Jerry Lam chiling...@gmail.com
 wrote:

 My experience with Mesos + Spark is not great. I saw one executor with
 30 CPU and the other executor with 6. So I don't think you can easily
 configure it without some tweaking at the source code.

 Sent from my iPad

 On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula 
 aharipriy...@gmail.com wrote:

 Hi Tim,

 Spark on Yarn allows us to do it using --num-executors and
 --executor_cores commandline arguments. I just got a chance to look at a
 similar spark user list mail, but no answer yet. So does mesos allow
 setting the number of executors and cores? Is there a default number it
 assumes?

 On Mon, Jan 5, 2015 at 5:07 PM, Tim Chen t...@mesosphere.io wrote:

 Forgot to hit reply-all.

 -- Forwarded message --
 From: Tim Chen t...@mesosphere.io
 Date: Sun, Jan 4, 2015 at 10:46 PM
 Subject: Re: Controlling number of executors on Mesos vs YARN
 To: mvle m...@us.ibm.com


 Hi Mike,

 You're correct there is no such setting in for Mesos coarse grain
 mode, since the assumption is that each node is launched with one 
 container
 and Spark is launching multiple tasks in that container.

 In fine-grain mode there isn't a setting like that, as it currently
 will launch an executor as long as it satisfies the minimum container
 resource requirement.

 I've created a JIRA earlier about capping the number of executors or
 better distribute the # of executors launched in each node. Since the
 decision of choosing what node to launch containers is all in the Spark
 scheduler side, it's very easy to modify it.

 Btw, what's the configuration to set the # of executors on YARN side?

 Thanks,

 Tim



 On Sun, Jan 4, 2015 at 9:37 PM, mvle m...@us.ibm.com wrote:

 I'm trying to compare the performance of Spark running on Mesos vs
 YARN.
 However, I am having problems being able to configure the Spark
 workload to
 run in a similar way on Mesos and YARN.

 When running Spark on YARN, you can specify the number of executors
 per
 node. So if I have a node with 4 CPUs, I can specify 6 executors on
 that
 node. When running Spark on Mesos, there doesn't seem to be an
 equivalent
 way to specify this. In Mesos, you can somewhat force this by
 specifying the
 number of CPU resources to be 6 when running the slave daemon.
 However, this
 seems to be a static configuration of the Mesos cluster rather
 something
 that can be configured in the Spark framework.

 So here is my question:

 For Spark on Mesos, am I correct

Re: How to increase parallelism of a Spark cluster?

2015-08-03 Thread Ajay Singal
Hi Sujit,



From experimenting with Spark (and other documentation), my understanding
is as follows:

1.   Each application consists of one or more Jobs

2.   Each Job has one or more Stages

3.   Each Stage creates one or more Tasks (normally, one Task per
Partition)

4.   Master allocates one Executor per Worker (that contains Partition)
per Application

5.   The Executor stays up for the lifetime of the Application (and
dies when the Application ends)

6.   Each Executor can run multiple Tasks in parallel (normally, the
parallelism depends on the number of cores per Executor).

7.   The Scheduler schedules only one Task from each Stage to one
Executor.

8.   If there are multiple Stages (from a Job) and these Stages could
be run asynchronously (i.e., in parallel), one Task from each Stage could
be scheduled on the same Executor (thus this Executor runs multiple Tasks
in parallel: see #6 above).



Of course, there could be many exception/exclusions to what I explained
above.  I expect that Spark community will confirm or correct my
observations/understanding above.



Now, let’s come back to your situation.  You have a cluster of 4 Workers
with 10 Partitions.  All of these 10 Partitions are distributed among these
4 Workers.  Also, from the information provided by you, your Application
has just one Job with a two Stages (repartition and mapPartition).  The
mapPartition Stage will have 10 Tasks.  Assuming my
observations/understanding is correct, by virtue of #7 above, only 4 Tasks
can be executed in parallel.  The subsequent Jobs will have to wait.



However, if you had 10 or more Workers, all Tasks would have been executed
in parallel.  BTW, I believe, you can have multiple Workers on one Physical
Node.  So, one of the solution to your problem would be to increase the
number of Workers.



Having said so, I believe #7 above is the bottleneck.  If there is no good
reason for keeping this bottleneck, this could be a good area of
improvement (and needs to be addressed by Spark community).  I will wait
for the community response, and if needed, I will open a JIRA item.



I hope it helps.



Regards,

Ajay

On Mon, Aug 3, 2015 at 1:16 PM, Sujit Pal sujitatgt...@gmail.com wrote:

 @Silvio: the mapPartitions instantiates a HttpSolrServer, then for each
 query string in the partition, sends the query to Solr using SolrJ, and
 gets back the top N results. It then reformats the result data into one
 long string and returns the key value pair as (query string, result string).

 @Igor: Thanks for the parameter suggestions. I will check the
 --num-executors and if there is a way to set the number of cores/executor
 with my Databricks admin and update here if I find it, but from the
 Databricks console, it appears that the number of executors per box is 1.
 This seems normal though, per the diagram on this page:

 http://spark.apache.org/docs/latest/cluster-overview.html

 where it seems that there is 1 executor per box, and each executor can
 spawn multiple threads to take care of multiple tasks (see bullet #1 copied
 below).

 Each application gets its own executor processes, which stay up for the
 duration of the whole application and run tasks in multiple threads. This
 has the benefit of isolating applications from each other, on both the
 scheduling side (each driver schedules its own tasks) and executor side
 (tasks from different applications run in different JVMs).


 Regarding hitting the max number of requests, thanks for the link. I am
 using the default client. Just peeked at the Solr code, and the default
 settings (if no HttpClient instance is supplied in the ctor) is to use
 DefaultHttpClient (from HttpComponents) whose settings are as follows:


- Version: HttpVersion.HTTP_1_1


- ContentCharset: HTTP.DEFAULT_CONTENT_CHARSET


- NoTcpDelay: true


- SocketBufferSize: 8192


- UserAgent: Apache-HttpClient/release (java 1.5)

 In addition, the Solr code sets the following additional config
 parameters on the DefaultHttpClient.

   params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 128);
   params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 32);
   params.set(HttpClientUtil.PROP_FOLLOW_REDIRECTS, followRedirects);

 Since all my connections are coming out of 2 worker boxes, it looks like I
 could get 32x2 = 64 clients hitting Solr, right?

 @Steve: Thanks for the link to the HttpClient config. I was thinking about
 using a thread pool (or better using a PoolingHttpClientManager per the
 docs), but it probably won't help since its still being fed one request at
 a time.
 @Abhishek: my observations agree with what you said. In the past I have
 had success with repartition to reduce the partition size especially when
 groupBy operations were involved. But I believe an executor should be able
 to handle multiple tasks in parallel from what I understand about Akka on
 which Spark is built - the worker is essentially an ActorSystem which 

Re: Facing problem in Oracle VM Virtual Box

2015-07-24 Thread Ajay Singal
Hi Chintan,


This is more of Oracle VirtualBox virtualization issue than Spark issue.



VT-x is hardware assisted virtualization and it is required by Oracle
VirtualBox for all (64 bits) guests. The error message indicates that
either your processor does not support VT-x (but your VM is configured to
use this feature), or VT-x is (available but) disabled in your BIOS.



To solve this problem, you have two options:

a)enable VT-x in your machine BIOS (so that VT-x is enabled for all the
VMs on your machine), or

b)disable VT-x for the VMs, (in your case, Hortonworks VM)



Hope this helps.

Ajay

On Thu, Jul 23, 2015 at 6:40 AM, Chintan Bhatt 
chintanbhatt...@charusat.ac.in wrote:

 Hi.
 I'm facing following error while running .ova file containing Hortonworks
 with Spark in Oracle VM Virtual Box:

 Failed to open a session for the virtual machine *Hortonworks Sandbox
 with HDP 2.2.4*.

 VT-x features locked or unavailable in MSR.
 (VERR_VMX_MSR_LOCKED_OR_DISABLED).
 Result Code: E_FAIL (0x80004005)Component: ConsoleInterface: IConsole
 {db7ab4ca-2a3f-4183-9243-c1208da92392}

 --
 CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/
 Assistant Professor,
 U  P U Patel Department of Computer Engineering,
 Chandubhai S. Patel Institute of Technology,
 Charotar University of Science And Technology (CHARUSAT),
 Changa-388421, Gujarat, INDIA.
 http://www.charusat.ac.in
 *Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/



Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Ajay Singal
Hi Joji,

To my knowledge, Spark does not offer any such function.

I agree, defining a function to find an open (random) port would be a good
option.  However,  in order to invoke the corresponding SparkUI one needs
to know this port number.

Thanks,
Ajay

On Fri, Jul 24, 2015 at 10:19 AM, Joji John jj...@ebates.com wrote:

  Thanks Ajay.


  The way we wrote our spark application is that we have a generic python
 code, multiple instances of which can be called using different parameters. 
 Does
 spark offer any function to bind it to a available port?


  I guess the other option is to define a function to find open port and
 use that.


  Thanks

 Joji John


  --
 *From:* Ajay Singal asinga...@gmail.com
 *Sent:* Friday, July 24, 2015 6:59 AM
 *To:* Joji John
 *Cc:* user@spark.apache.org
 *Subject:* Re: ERROR SparkUI: Failed to bind SparkUI
 java.net.BindException: Address already in use: Service 'SparkUI' failed
 after 16 retries!

  Hi Jodi,

  I guess, there is no hard limit on number of Spark applications running
 in parallel.  However, you need to ensure that you do not use the same
 (e.g., default) port numbers for each application.

  In your specific case, for example, if you try using default SparkUI
 port 4040 for more than one Spark applications, the first application you
 start will bind to port 4040. So, this port becomes unavailable (at this
 moment).  Therefore, all subsequent applications you start will get SparkUI
 BindException.

  To solve this issue, simply use non-competing port numbers, e.g., 4040,
 4041, 4042...

  Thanks,
 Ajay

 On Fri, Jul 24, 2015 at 6:21 AM, Joji John jj...@ebates.com wrote:

  *HI,*

 *I am getting this error for some of spark applications. I have multiple
 spark applications running in parallel. Is there a limit in the number of
 spark applications that I can run in parallel.*



 *ERROR SparkUI: Failed to bind SparkUI*

 *java.net.BindException: Address already in use: Service 'SparkUI' failed
 after 16 retries!*





 *Thanks*

 *Joji john*







Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-24 Thread Ajay Singal
Hi Jodi,

I guess, there is no hard limit on number of Spark applications running in
parallel.  However, you need to ensure that you do not use the same (e.g.,
default) port numbers for each application.

In your specific case, for example, if you try using default SparkUI port
4040 for more than one Spark applications, the first application you
start will bind to port 4040. So, this port becomes unavailable (at this
moment).  Therefore, all subsequent applications you start will get SparkUI
BindException.

To solve this issue, simply use non-competing port numbers, e.g., 4040,
4041, 4042...

Thanks,
Ajay

On Fri, Jul 24, 2015 at 6:21 AM, Joji John jj...@ebates.com wrote:

  *HI,*

 *I am getting this error for some of spark applications. I have multiple
 spark applications running in parallel. Is there a limit in the number of
 spark applications that I can run in parallel.*



 *ERROR SparkUI: Failed to bind SparkUI*

 *java.net.BindException: Address already in use: Service 'SparkUI' failed
 after 16 retries!*





 *Thanks*

 *Joji john*





Instantiating/starting Spark jobs programmatically

2015-04-20 Thread Ajay Singal
Greetings,

We have an analytics workflow system in production.  This system is built in
Java and utilizes other services (including Apache Solr).  It works fine
with moderate level of data/processing load.  However, when the load goes
beyond certain limit (e.g., more than 10 million messages/documents) delays
start to show up.  No doubt this is a scalability issue, and Hadoop
ecosystem, especially Spark, can be handy in this situation.  The simplest
approach would be to rebuild the entire workflow using Spark, Kafka and
other components.  However, we decided to handle the problem in a couple of
phases.  In first phase we identified a few pain points (areas where
performance suffers most) and have started building corresponding mini Spark
applications (so as to take advantage of parallelism).

For now my question is: how can we instantiate/start our mini Spark jobs
programmatically (e.g., from Java applications)?  Only option I see in the
documentation is to run the jobs through command line (using spark-submit). 
Any insight in this area would be highly appreciated.

In longer term, I want to construct a collection of mini Spark applications
(each performing one specific task, similar to web services), and
architect/design bigger Spark based applications which in term will call
these mini Spark applications programmatically.  There is a possibility that
the Spark community has already started building such collection of
services.  Can you please provide some information/tips/best-practices in
this regard?

Cheers!
Ajay




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Instantiating-starting-Spark-jobs-programmatically-tp22577.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org