Re: Enforcing shuffle hash join

2016-07-04 Thread Sun Rui
You can try set “spark.sql.join.preferSortMergeJoin” cons option to false.

For detailed join strategies, take a look at the source code of 
SparkStrategies.scala:
/**
 * Select the proper physical plan for join based on joining keys and size of 
logical plan.
 *
 * At first, uses the [[ExtractEquiJoinKeys]] pattern to find joins where at 
least some of the
 * predicates can be evaluated by matching join keys. If found,  Join 
implementations are chosen
 * with the following precedence:
 *
 * - Broadcast: if one side of the join has an estimated physical size that is 
smaller than the
 * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold
 * or if that side has an explicit broadcast hint (e.g. the user applied the
 * [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), 
then that side
 * of the join will be broadcasted and the other side will be streamed, 
with no shuffling
 * performed. If both sides of the join are eligible to be broadcasted then 
the
 * - Shuffle hash join: if the average size of a single partition is small 
enough to build a hash
 * table.
 * - Sort merge: if the matching join keys are sortable.
 *
 * If there is no joining keys, Join implementations are chosen with the 
following precedence:
 * - BroadcastNestedLoopJoin: if one side of the join could be broadcasted
 * - CartesianProduct: for Inner join
 * - BroadcastNestedLoopJoin
 */


> On Jul 5, 2016, at 13:28, Lalitha MV  wrote:
> 
> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set to 
> -1, or when the size of the small table is more than 
> spark.sql.spark.sql.autoBroadcastJoinThreshold.
> 
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro  > wrote:
> The join selection can be described in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>  
> .
> If you have join keys, you can set -1 at 
> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then, hash 
> joins are used in queries.
> 
> // maropu 
> 
> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  > wrote:
> Hi maropu, 
> 
> Thanks for your reply. 
> 
> Would it be possible to write a rule for this, to make it always pick shuffle 
> hash join, over other join implementations(i.e. sort merge and broadcast)? 
> 
> Is there any documentation demonstrating rule based transformation for 
> physical plan trees? 
> 
> Thanks,
> Lalitha
> 
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro  > wrote:
> Hi,
> 
> No, spark has no hint for the hash join.
> 
> // maropu
> 
> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  > wrote:
> Hi, 
> 
> In order to force broadcast hash join, we can set the 
> spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce 
> shuffle hash join in spark sql? 
> 
> 
> Thanks,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro
> 
> 
> 
> -- 
> Regards,
> Lalitha



Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
What's the query?

On Tue, Jul 5, 2016 at 2:28 PM, Lalitha MV  wrote:

> It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is
> set to -1, or when the size of the small table is more than spark.sql.
> spark.sql.autoBroadcastJoinThreshold.
>
> On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro 
> wrote:
>
>> The join selection can be described in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>> .
>> If you have join keys, you can set -1 at
>> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
>> hash joins are used in queries.
>>
>> // maropu
>>
>> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
>>
>>> Hi maropu,
>>>
>>> Thanks for your reply.
>>>
>>> Would it be possible to write a rule for this, to make it always pick
>>> shuffle hash join, over other join implementations(i.e. sort merge and
>>> broadcast)?
>>>
>>> Is there any documentation demonstrating rule based transformation for
>>> physical plan trees?
>>>
>>> Thanks,
>>> Lalitha
>>>
>>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro >> > wrote:
>>>
 Hi,

 No, spark has no hint for the hash join.

 // maropu

 On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV 
 wrote:

> Hi,
>
> In order to force broadcast hash join, we can set
> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
> shuffle hash join in spark sql?
>
>
> Thanks,
> Lalitha
>



 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>>
>>> --
>>> Regards,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>



-- 
---
Takeshi Yamamuro


Re: Enforcing shuffle hash join

2016-07-04 Thread Lalitha MV
It picks sort merge join, when spark.sql.autoBroadcastJoinThreshold is set
to -1, or when the size of the small table is more than spark.sql.spark.sql.
autoBroadcastJoinThreshold.

On Mon, Jul 4, 2016 at 10:17 PM, Takeshi Yamamuro 
wrote:

> The join selection can be described in
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
> .
> If you have join keys, you can set -1 at
> `spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
> hash joins are used in queries.
>
> // maropu
>
> On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:
>
>> Hi maropu,
>>
>> Thanks for your reply.
>>
>> Would it be possible to write a rule for this, to make it always pick
>> shuffle hash join, over other join implementations(i.e. sort merge and
>> broadcast)?
>>
>> Is there any documentation demonstrating rule based transformation for
>> physical plan trees?
>>
>> Thanks,
>> Lalitha
>>
>> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> No, spark has no hint for the hash join.
>>>
>>> // maropu
>>>
>>> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV 
>>> wrote:
>>>
 Hi,

 In order to force broadcast hash join, we can set
 the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
 shuffle hash join in spark sql?


 Thanks,
 Lalitha

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>>
>> --
>> Regards,
>> Lalitha
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Regards,
Lalitha


Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
The join selection can be described in
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
.
If you have join keys, you can set -1 at
`spark.sql.autoBroadcastJoinThreshold` to disable broadcast joins. Then,
hash joins are used in queries.

// maropu

On Tue, Jul 5, 2016 at 4:23 AM, Lalitha MV  wrote:

> Hi maropu,
>
> Thanks for your reply.
>
> Would it be possible to write a rule for this, to make it always pick
> shuffle hash join, over other join implementations(i.e. sort merge and
> broadcast)?
>
> Is there any documentation demonstrating rule based transformation for
> physical plan trees?
>
> Thanks,
> Lalitha
>
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro 
> wrote:
>
>> Hi,
>>
>> No, spark has no hint for the hash join.
>>
>> // maropu
>>
>> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  wrote:
>>
>>> Hi,
>>>
>>> In order to force broadcast hash join, we can set
>>> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
>>> shuffle hash join in spark sql?
>>>
>>>
>>> Thanks,
>>> Lalitha
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> Regards,
> Lalitha
>



-- 
---
Takeshi Yamamuro


Spark Dataframe validating column names

2016-07-04 Thread Scott W
Hello,

I'm processing events using Dataframes converted from a stream of JSON
events (Spark streaming) which eventually gets written out as as Parquet
format. There are different JSON events coming in so we use schema
inference feature of Spark SQL

The problem is some of the JSON events contains spaces in the keys which I
want to log and filter/drop such events from the data frame before
converting it to Parquet because ,;{}()\n\t= are considered special
characters in Parquet schema (CatalystSchemaConverter) as listed in [1]
below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop
such an event altogether without erroring out the Spark Streaming job?

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in Parquet schema
checkConversionRequirement(
  !name.matches(".*[ ,;{}()\n\t=].*"),
  s"""Attribute name "$name" contains invalid character(s) among "
,;{}()\\n\\t=".
 |Please use alias to rename it.
   """.stripMargin.split("\n").mkString(" ").trim)
  }


Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when
I have time.

2016-07-04 14:09 GMT-07:00 mshiryae :

> Hi,
>
> I am trying to train model by MultilayerPerceptronClassifier.
>
> It works on sample data from
> data/mllib/sample_multiclass_classification_data.txt with 4 features, 3
> classes and layers [4, 4, 3].
> But when I try to use other input files with other features and classes
> (from here for example:
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html)
> then I get errors.
>
> Example:
> Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]):
>
>
> with block size = 1:
> ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation.
> Decreasing step size to Infinity
> ERROR LBFGS: Failure! Resetting history:
> breeze.optimize.FirstOrderException: Line search failed
> ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is
> just poorly behaved?
>
>
> with default block size = 128:
>  java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
>   at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
>at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)
>
>
>
> Even if I modify sample_multiclass_classification_data.txt file (rename all
> 4-th features to 5-th) and run with layers [5, 5, 3] then I also get the
> same errors as for file above.
>
>
> So to resume:
> I can't run training with default block size and with more than 4 features.
> If I set  block size to 1 then some actions are happened but I get errors
> from LBFGS.
> It is reproducible with Spark 1.5.2 and from master branch on github (from
> 4-th July).
>
> Did somebody already met with such behavior?
> Is there bug in MultilayerPerceptronClassifier or I use it incorrectly?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-MultilayerPerceptronClassifier-error-tp27279.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to handle update/deletion in Structured Streaming?

2016-07-04 Thread Tathagata Das
Input datasets which represent a input data stream only supports appending
of new rows, as the stream is modeled as an unbounded table where new data
in the stream are new rows being appended to the table. For transformed
datasets generated from the input dataset, rows can be updated and removed
as input dataset has added rows. To take a concrete example, if you are
maintaining a running word count dataset, every time the input dataset has
new rows, the rows of the word count dataset will get updated. Where this
really matters is when the transformed data needs written to a output sink
and that's where the output modes decide how the updated/deleted rows are
written to the sink. Currently, Spark 2.0 will support only the Complete
Mode, where after any update, ALL the rows (i.e. added, updated, and
unchanged rows) of the word count dataset will be given to the sink for
output. Future version of Spark will have the Update mode, where only the
added/updated rows will be given to the sink.

On Mon, Jul 4, 2016 at 8:23 AM, Arnaud Bailly 
wrote:

> Hello,
>
> I am interested in using the new Structured Streaming feature of Spark SQL
> and am currently doing some experiments on code at HEAD. I would like to
> have a better understanding of how deletion should be handled in a
> structured streaming setting.
>
> Given some incremental query computing an arbitrary aggregation over some
> dataset, inserting new values is somewhat obvious: simply update the
> aggregate computation tree with whatever new values are added to input
> datasets/datastreams. But things are not so obvious for updates and
> deletions: do they have a representation in the input datastreams? If I
> have a query that aggregates some value over some key, and I delete all
> instances of that key, I would expect the query to output a result removing
> the key's aggregated value. The same is true for updates...
>
> Thanks for any insights you might want to share.
>
> Regards,
> --
> Arnaud Bailly
>
> twitter: abailly
> skype: arnaud-bailly
> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>


Pregel algorithm edge direction docs

2016-07-04 Thread svjk24

Hi,
  I'm looking through the Pregel algorithm Scaladocs and least in 1.5.1 
there seems to me some contradiction between the specification for 
 and :


activeDirection

   the direction of edges incident to a vertex that received a message
   in the previous round on which to run|sendMsg|. For example, if this
   is|EdgeDirection.Out|, only out-edges of vertices that received a
   message in the previous round will run.



sendMsg

   a user supplied function that is applied to out edges of vertices
   that received messages in the current iteration


As I understand it it seems  specifies which edges 
incident to a given vertex will undergo the  operation in the 
next iteration, so it would not be correct that  is "a user 
supplied function that is applied to *out* edges of vertices", but 
rather "a user supplied function that is applied to  
edges of vertices"? Am I missing something conceptually or is this a 
documentation error?




Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
Found some warn and error messages in driver log file:
2016-07-04 04:49:50,106 [main] WARN  DataNucleus.General- Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR version
s of the same plugin in the classpath. The URL
"file:/opt/mapr/hive/hive-1.2/lib/datanucleus-api-jdo-4.2.1.jar" is already
registered, and you are trying to registe
r an identical plugin located at URL
"file:/opt/mapr/spark/spark-1.6.1/lib/datanucleus-api-jdo-4.2.1.jar."
2016-07-04 04:49:50,115 [main] WARN  DataNucleus.General- Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple JAR
versions of the
 same plugin in the classpath. The URL
"file:/opt/mapr/hive/hive-1.2/lib/datanucleus-core-4.1.6.jar" is already
registered, and you are trying to register an identi
cal plugin located at URL
"file:/opt/mapr/spark/spark-1.6.1/lib/datanucleus-core-4.1.6.jar."
2016-07-04 04:49:50,136 [main] WARN  DataNucleus.General- Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have
multiple JAR ver
sions of the same plugin in the classpath. The URL
"file:/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-4.1.7.jar" is already
registered, and you are trying to regis
ter an identical plugin located at URL
"file:/opt/mapr/spark/spark-1.6.1/lib/datanucleus-rdbms-4.1.7.jar."
2016-07-04 04:49:57,387 [main] WARN
 org.apache.hadoop.hive.metastore.ObjectStore- Version information not
found in metastore. hive.metastore.schema.verification is
 not enabled so recording the schema version 1.2.0
2016-07-04 04:49:57,563 [main] WARN
 org.apache.hadoop.hive.metastore.ObjectStore- Failed to get database
default, returning NoSuchObjectException
2016-07-04 04:50:32,046 [main] WARN  org.apache.spark.SparkConf- The
configuration key 'spark.kryoserializer.buffer.max.mb' has been deprecated
as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer.max' instead.
2016-07-04 04:50:32,048 [main] WARN  org.apache.spark.SparkConf- The
configuration key 'spark.kryoserializer.buffer.max.mb' has been deprecated
as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer.max' instead.
2016-07-04 04:50:32,170 [main] WARN  DataNucleus.General- Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have
multiple JAR version
s of the same plugin in the classpath. The URL
"file:/opt/mapr/hive/hive-1.2/lib/datanucleus-api-jdo-4.2.1.jar" is already
registered, and you are trying to registe
r an identical plugin located at URL
"file:/opt/mapr/spark/spark-1.6.1/lib/datanucleus-api-jdo-4.2.1.jar."
2016-07-04 04:50:32,173 [main] WARN  DataNucleus.General- Plugin (Bundle)
"org.datanucleus" is already registered. Ensure you dont have multiple JAR
versions of the
 same plugin in the classpath. The URL
"file:/opt/mapr/hive/hive-1.2/lib/datanucleus-core-4.1.6.jar" is already
registered, and you are trying to register an identi
cal plugin located at URL
"file:/opt/mapr/spark/spark-1.6.1/lib/datanucleus-core-4.1.6.jar."
2016-07-04 04:50:32,183 [main] WARN  DataNucleus.General- Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have
multiple JAR ver
sions of the same plugin in the classpath. The URL
"file:/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-4.1.7.jar" is already
registered, and you are trying to regis
ter an identical plugin located at URL
"file:/opt/mapr/spark/spark-1.6.1/lib/datanucleus-rdbms-4.1.7.jar."
2016-07-04 04:50:35,678 [main] WARN
 org.apache.hadoop.hive.metastore.ObjectStore- Failed to get database
default, returning NoSuchObjectException
2016-07-04 05:46:50,052 [main] WARN  org.apache.spark.SparkConf- The
configuration key 'spark.kryoserializer.buffer.max.mb' has been deprecated
as of Spark 1.4 and
and may be removed in the future. Please use the new key
'spark.kryoserializer.buffer.max' instead.
2016-07-04 05:50:09,944 [Thread-27-EventThread] WARN
 com.mapr.util.zookeeper.ZKDataRetrieval- ZK Reset due to SessionExpiration
for ZK: hostname:5181,hostname:5181,hostname:5181
2016-07-04 05:11:53,972 [dispatcher-event-loop-0] ERROR
org.apache.spark.scheduler.LiveListenerBus- Dropping SparkListenerEvent
because no remaining room in event q
ueue. This likely means one of the SparkListeners is too slow and cannot
keep up with the rate at which tasks are being started by the scheduler.


On Tue, Jul 5, 2016 at 4:45 AM, kishore kumar  wrote:

> Find the log from rm below, before FNFE there is no earlier errors in
> driver log,
>
> 16/07/04 00:27:56 INFO mapreduce.TableInputFormatBase: Input split length:
> 0 bytes.
> 16/07/04 00:27:56 INFO executor.Executor: Executor is trying to kill task
> 56.0 in stage 2437.0 (TID 328047)
> 16/07/04 00:27:56 INFO executor.Executor: Executor killed task 266.0 in
> stage 2433.0 (TID 328005)
> 16/07/04 00:27:56 INFO executor.Executor: Executor killed task 206.0 in
> stage 

Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
Find the log from rm below, before FNFE there is no earlier errors in
driver log,

16/07/04 00:27:56 INFO mapreduce.TableInputFormatBase: Input split length:
0 bytes.
16/07/04 00:27:56 INFO executor.Executor: Executor is trying to kill task
56.0 in stage 2437.0 (TID 328047)
16/07/04 00:27:56 INFO executor.Executor: Executor killed task 266.0 in
stage 2433.0 (TID 328005)
16/07/04 00:27:56 INFO executor.Executor: Executor killed task 206.0 in
stage 2433.0 (TID 327977)
16/07/04 00:27:56 INFO executor.Executor: Executor killed task 318.0 in
stage 2433.0 (TID 328006)
16/07/04 00:27:57 INFO executor.Executor: Executor killed task 56.0 in
stage 2437.0 (TID 328047)
16/07/04 00:27:57 INFO executor.CoarseGrainedExecutorBackend: Driver
commanded a shutdown
16/07/04 00:27:57 INFO storage.MemoryStore: MemoryStore cleared
16/07/04 00:27:57 INFO storage.BlockManager: BlockManager stopped
16/07/04 00:27:57 WARN executor.CoarseGrainedExecutorBackend: An unknown (
driver.domain.com:56055) driver disconnected.
16/07/04 00:27:57 ERROR executor.CoarseGrainedExecutorBackend: Driver
xx:xx:xx:xx:56055 disassociated! Shutting down.
16/07/04 00:27:57 INFO util.ShutdownHookManager: Shutdown hook called
16/07/04 00:27:57 INFO util.ShutdownHookManager: Deleting directory
/opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/user/appcache/application_1467474162580_29353/spark-9c0bfccc-74c3-4541-a2fd-19101e47b49a
End of LogType:stderr


On Mon, Jul 4, 2016 at 4:21 PM, Jacek Laskowski  wrote:

> Can you share some stats from Web UI just before the failure? Any earlier
> errors before FNFE?
>
> Jacek
> On 4 Jul 2016 12:34 p.m., "kishore kumar"  wrote:
>
>> @jacek: It is running on yarn-client mode, our code don't support running
>> in yarn-cluster mode and the job is running for around an hour and giving
>> the exception.
>>
>> @karhi: yarn application status is successful, resourcemanager logs did
>> not give any failure info except
>> 16/07/04 00:27:57 INFO executor.CoarseGrainedExecutorBackend: Driver
>> commanded a shutdown
>> 16/07/04 00:27:57 INFO storage.MemoryStore: MemoryStore cleared
>> 16/07/04 00:27:57 INFO storage.BlockManager: BlockManager stopped
>> 16/07/04 00:27:57 WARN executor.CoarseGrainedExecutorBackend: An unknown (
>> slave1.domain.com:56055) driver disconnected.
>> 16/07/04 00:27:57 ERROR executor.CoarseGrainedExecutorBackend: Driver
>> 173.36.88.26:56055 disassociated! Shutting down.
>> 16/07/04 00:27:57 INFO util.ShutdownHookManager: Shutdown hook called
>> 16/07/04 00:27:57 INFO util.ShutdownHookManager: Deleting directory
>> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/user/appcache/application_1467474162580_29353/spark-9c0bfccc-74c3-4541-a2fd-19101e47b49a
>> End of LogType:stderr
>>
>>
>> On Mon, Jul 4, 2016 at 3:20 PM, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> You seem to be using yarn. Is this cluster or client deploy mode? Have
>>> you seen any other exceptions before? How long did the application run
>>> before the exception?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Mon, Jul 4, 2016 at 10:57 AM, kishore kumar 
>>> wrote:
>>> > We've upgraded spark version from 1.2 to 1.6 still the same problem,
>>> >
>>> > Exception in thread "main" org.apache.spark.SparkException: Job
>>> aborted due
>>> > to stage failure: Task 286 in stage
>>> > 2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage
>>> 2397.0
>>> > (TID 314416, salve-06.domain.com): java.io.FileNotFoundException:
>>> > /opt/mapr/tmp/h
>>> >
>>> adoop-tmp/hadoop-mapr/nm-local-dir/usercache/user1/appcache/application_1467474162580_29353/blockmgr-bd075392-19c2-4cb8-8033-0fe54d683c8f/12/shuffle_530_286_0.inde
>>> > x.c374502a-4cf2-4052-abcf-42977f1623d0 (No such file or directory)
>>> >
>>> > Kindly help me to get rid from this.
>>> >
>>> > On Sun, Jun 5, 2016 at 9:43 AM, kishore kumar 
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> Could anyone help me about this error ? why this error comes ?
>>> >>
>>> >> Thanks,
>>> >> KishoreKuamr.
>>> >>
>>> >> On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar 
>>> >> wrote:
>>> >>>
>>> >>> Hi Jeff Zhang,
>>> >>>
>>> >>> Thanks for response, could you explain me why this error occurs ?
>>> >>>
>>> >>> On Fri, Jun 3, 2016 at 6:15 PM, Jeff Zhang  wrote:
>>> 
>>>  One quick solution is to use spark 1.6.1.
>>> 
>>>  On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar <
>>> akishore...@gmail.com>
>>>  wrote:
>>> >
>>> > Could anyone help me on this issue ?
>>> >
>>> > On Tue, May 31, 2016 at 8:00 PM, kishore kumar <
>>> akishore...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> We installed spark1.2.1 in single node, 

Re: log traces

2016-07-04 Thread Jean Georges Perrin
Hi Luis,

Right...

I managed all my Spark "things" through Maven, bu that I mean I have a pom.xml 
with all the dependencies in it. Here it is:

http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0
app
main
0.0.1



maven-compiler-plugin
3.3

1.7
1.7






mysql
mysql-connector-java
5.1.6


org.hibernate
hibernate-core
5.2.0.Final


org.apache.spark
spark-core_2.10
1.6.2


org.apache.spark
spark-sql_2.10
1.6.2
provided


com.databricks
spark-csv_2.10
1.4.0


org.apache.commons
commons-lang3
3.4


joda-time
joda-time
2.9.4





When I run the application, I run it, no through Maven, but through Eclipse as 
a run configuration.

At not point I see or set a SPARK_HOME. I tried programmatically as well and 
Spark does not get it.

I do not connect to a Spark cluster (yet) just on my machine...

I hope it is clear, just started spark'ing recently...

jg




> On Jul 4, 2016, at 6:28 PM, Luis Mateos  wrote:
> 
> Hi Jean, 
> 
> What do you mean by "running everything through maven"? Usually, applications 
> are compiled using maven and then launched by using the 
> $SPARK_HOME/bin/spark-submit script. It might be helpful to provide us more 
> details on how you are running your application.
> 
> Regards,
> Luis
> 
> On 4 July 2016 at 16:57, Jean Georges Perrin  > wrote:
> Hey Anupam,
> 
> Thanks... but no:
> 
> I tried:
> 
>   SparkConf conf = new SparkConf().setAppName("my 
> app").setMaster("local");
>   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>   javaSparkContext.setLogLevel("WARN");
>   SQLContext sqlContext = new SQLContext(javaSparkContext);
> 
> and
> 
>   SparkConf conf = new SparkConf().setAppName("my 
> app").setMaster("local");
>   SparkContext sc = new SparkContext(conf);
>   sc.setLogLevel("WARN");
>   SQLContext sqlContext = new SQLContext(sc);
> 
> and they are still very upset at my console :)...
> 
> 
>> On Jul 4, 2016, at 5:28 PM, Anupam Bhatnagar > > wrote:
>> 
>> Hi Jean,
>> 
>> How about using sc.setLogLevel("WARN") ? You may add this statement after 
>> initializing the Spark Context. 
>> 
>> From the Spark API - "Valid log levels include: ALL, DEBUG, ERROR, FATAL, 
>> INFO, OFF, TRACE, WARN". Here's the link in the Spark API. 
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
>>  
>> 
>> 
>> Hope this helps,
>> Anupam
>>   
>> 
>> 
>> 
>> On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin > > wrote:
>> Thanks Mich, but what is SPARK_HOME when you run everything through Maven?
>> 
>>> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh >> > wrote:
>>> 
>>> check %SPARK_HOME/conf
>>> 
>>> copy file log4j.properties.template to log4j.properties
>>> 
>>> edit log4j.properties and set the log levels to your needs
>>> 
>>> cat log4j.properties
>>> 
>>> # Set everything to be logged to the console
>>> log4j.rootCategory=ERROR, console
>>> log4j.appender.console=org.apache.log4j.ConsoleAppender
>>> log4j.appender.console.target=System.err
>>> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>>> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
>>> %c{1}: %m%n
>>> # Settings to quiet third party logs that are too verbose
>>> log4j.logger.org.spark-project.jetty=WARN
>>> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
>>> 

Re: log traces

2016-07-04 Thread Jean Georges Perrin
Hey Anupam,

Thanks... but no:

I tried:

SparkConf conf = new SparkConf().setAppName("my 
app").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
javaSparkContext.setLogLevel("WARN");
SQLContext sqlContext = new SQLContext(javaSparkContext);

and

SparkConf conf = new SparkConf().setAppName("my 
app").setMaster("local");
SparkContext sc = new SparkContext(conf);
sc.setLogLevel("WARN");
SQLContext sqlContext = new SQLContext(sc);

and they are still very upset at my console :)...


> On Jul 4, 2016, at 5:28 PM, Anupam Bhatnagar  
> wrote:
> 
> Hi Jean,
> 
> How about using sc.setLogLevel("WARN") ? You may add this statement after 
> initializing the Spark Context. 
> 
> From the Spark API - "Valid log levels include: ALL, DEBUG, ERROR, FATAL, 
> INFO, OFF, TRACE, WARN". Here's the link in the Spark API. 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
>  
> 
> 
> Hope this helps,
> Anupam
>   
> 
> 
> 
> On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin  > wrote:
> Thanks Mich, but what is SPARK_HOME when you run everything through Maven?
> 
>> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh > > wrote:
>> 
>> check %SPARK_HOME/conf
>> 
>> copy file log4j.properties.template to log4j.properties
>> 
>> edit log4j.properties and set the log levels to your needs
>> 
>> cat log4j.properties
>> 
>> # Set everything to be logged to the console
>> log4j.rootCategory=ERROR, console
>> log4j.appender.console=org.apache.log4j.ConsoleAppender
>> log4j.appender.console.target=System.err
>> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
>> %c{1}: %m%n
>> # Settings to quiet third party logs that are too verbose
>> log4j.logger.org.spark-project.jetty=WARN
>> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
>> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
>> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
>> log4j.logger.org.apache.parquet=ERROR
>> log4j.logger.parquet=ERROR
>> # SPARK-9183: Settings to avoid annoying messages when looking up 
>> nonexistent UDFs in SparkSQL with Hive support
>> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
>> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 4 July 2016 at 21:56, Jean Georges Perrin > > wrote:
>> Hi,
>> 
>> I have installed Apache Spark via Maven.
>> 
>> How can I control the volume of log it displays on my system? I tried 
>> different location for a log4j.properties, but none seems to work for me.
>> 
>> Thanks for help...
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> 
>> 
>> 
> 
> 



Re: log traces

2016-07-04 Thread Anupam Bhatnagar
Hi Jean,

How about using sc.setLogLevel("WARN") ? You may add this statement after
initializing the Spark Context.

>From the Spark API - "Valid log levels include: ALL, DEBUG, ERROR, FATAL,
INFO, OFF, TRACE, WARN". Here's the link in the Spark API.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

Hope this helps,
Anupam




On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin  wrote:

> Thanks Mich, but what is SPARK_HOME when you run everything through Maven?
>
> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh 
> wrote:
>
> check %SPARK_HOME/conf
>
> copy file log4j.properties.template to log4j.properties
>
> edit log4j.properties and set the log levels to your needs
>
> cat log4j.properties
>
> # Set everything to be logged to the console
> log4j.rootCategory=ERROR, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
> %c{1}: %m%n
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up
> nonexistent UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 July 2016 at 21:56, Jean Georges Perrin  wrote:
>
>> Hi,
>>
>> I have installed Apache Spark via Maven.
>>
>> How can I control the volume of log it displays on my system? I tried
>> different location for a log4j.properties, but none seems to work for me.
>>
>> Thanks for help...
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>


Re: log traces

2016-07-04 Thread Jean Georges Perrin
Thanks Mich, but what is SPARK_HOME when you run everything through Maven?

> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh  wrote:
> 
> check %SPARK_HOME/conf
> 
> copy file log4j.properties.template to log4j.properties
> 
> edit log4j.properties and set the log levels to your needs
> 
> cat log4j.properties
> 
> # Set everything to be logged to the console
> log4j.rootCategory=ERROR, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 4 July 2016 at 21:56, Jean Georges Perrin  > wrote:
> Hi,
> 
> I have installed Apache Spark via Maven.
> 
> How can I control the volume of log it displays on my system? I tried 
> different location for a log4j.properties, but none seems to work for me.
> 
> Thanks for help...
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 



Re: log traces

2016-07-04 Thread Mich Talebzadeh
check %SPARK_HOME/conf

copy file log4j.properties.template to log4j.properties

edit log4j.properties and set the log levels to your needs

cat log4j.properties

# Set everything to be logged to the console
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up
nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 July 2016 at 21:56, Jean Georges Perrin  wrote:

> Hi,
>
> I have installed Apache Spark via Maven.
>
> How can I control the volume of log it displays on my system? I tried
> different location for a log4j.properties, but none seems to work for me.
>
> Thanks for help...
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread mshiryae
Hi,

I am trying to train model by MultilayerPerceptronClassifier.

It works on sample data from
data/mllib/sample_multiclass_classification_data.txt with 4 features, 3
classes and layers [4, 4, 3].
But when I try to use other input files with other features and classes
(from here for example:
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html)
then I get errors.

Example:
Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]):


with block size = 1:
ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation.
Decreasing step size to Infinity
ERROR LBFGS: Failure! Resetting history:
breeze.optimize.FirstOrderException: Line search failed
ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is
just poorly behaved?


with default block size = 128:
 java.lang.ArrayIndexOutOfBoundsException
  at java.lang.System.arraycopy(Native Method)
  at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
  at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
   at scala.collection.immutable.List.foreach(List.scala:381)
   at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
   at
org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)



Even if I modify sample_multiclass_classification_data.txt file (rename all
4-th features to 5-th) and run with layers [5, 5, 3] then I also get the
same errors as for file above.


So to resume:
I can't run training with default block size and with more than 4 features.
If I set  block size to 1 then some actions are happened but I get errors
from LBFGS.
It is reproducible with Spark 1.5.2 and from master branch on github (from
4-th July).

Did somebody already met with such behavior?
Is there bug in MultilayerPerceptronClassifier or I use it incorrectly?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-MultilayerPerceptronClassifier-error-tp27279.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



log traces

2016-07-04 Thread Jean Georges Perrin
Hi,

I have installed Apache Spark via Maven.

How can I control the volume of log it displays on my system? I tried different 
location for a log4j.properties, but none seems to work for me. 

Thanks for help...
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
well this will be apparent from the Environment tab of GUI. It will show
how the job is actually running.

Jacek's point is correct. I suspect this is actually running in Local mode
as it looks consuming all from the master node.

HTH







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 July 2016 at 20:35, Jacek Laskowski  wrote:

> On Mon, Jul 4, 2016 at 8:36 PM, Mathieu Longtin 
> wrote:
>
>> Are you using a --master argument, or equivalent config, when calling
>> spark-submit?
>>
>> If you don't, it runs in standalone mode.
>>
>
> s/standalone/local[*]
>
> Jacek
>


pyspark aggregate vectors from onehotencoder

2016-07-04 Thread Sebastian Kuepers
hey,

what is best practice to aggregate the vectors from onehotencoders in pyspark?

udafs are still not available in python.
is there any way to do it with spark sql?

or do you have to switch to rdds and do it with a reduceByKey for example?

thanks,
sebastian







Disclaimer The information in this email and any attachments may contain 
proprietary and confidential information that is intended for the addressee(s) 
only. If you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution, retention or use of the contents of this 
information is prohibited. When addressed to our clients or vendors, any 
information contained in this e-mail or any attachments is subject to the terms 
and conditions in any governing contract. If you have received this e-mail in 
error, please immediately contact the sender and delete the e-mail.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jacek Laskowski
On Mon, Jul 4, 2016 at 8:36 PM, Mathieu Longtin 
wrote:

> Are you using a --master argument, or equivalent config, when calling
> spark-submit?
>
> If you don't, it runs in standalone mode.
>

s/standalone/local[*]

Jacek


Re: Enforcing shuffle hash join

2016-07-04 Thread Lalitha MV
Hi maropu,

Thanks for your reply.

Would it be possible to write a rule for this, to make it always pick
shuffle hash join, over other join implementations(i.e. sort merge and
broadcast)?

Is there any documentation demonstrating rule based transformation for
physical plan trees?

Thanks,
Lalitha

On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro 
wrote:

> Hi,
>
> No, spark has no hint for the hash join.
>
> // maropu
>
> On Fri, Jul 1, 2016 at 4:56 PM, Lalitha MV  wrote:
>
>> Hi,
>>
>> In order to force broadcast hash join, we can set
>> the spark.sql.autoBroadcastJoinThreshold config. Is there a way to enforce
>> shuffle hash join in spark sql?
>>
>>
>> Thanks,
>> Lalitha
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>



-- 
Regards,
Lalitha


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
Are you using a --master argument, or equivalent config, when calling
spark-submit?

If you don't, it runs in standalone mode.

On Mon, Jul 4, 2016 at 2:27 PM Jakub Stransky  wrote:

> Hi Mich,
>
> sure that workers are mentioned in slaves file. I can see them in spark
> master UI and even after start they are "blocked" for this application but
> the cpu and memory consumption is close to nothing.
>
> Thanks
> Jakub
>
> On 4 July 2016 at 18:36, Mich Talebzadeh 
> wrote:
>
>> Silly question. Have you added your workers to sbin/slaves file and have
>> you started start-slaves.sh.
>>
>> on master node when you type jps what do you see?
>>
>> The problem seems to be that workers are ignored and spark is essentially
>> running in Local mode
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 July 2016 at 17:05, Jakub Stransky  wrote:
>>
>>> Hi Mich,
>>>
>>> I have set up spark default configuration in conf directory
>>> spark-defaults.conf where I specify master hence no need to put it in
>>> command line
>>> spark.master   spark://spark.master:7077
>>>
>>> the same applies to driver memory which has been increased to 4GB
>>>  and the same is for spark.executor.memory 12GB as machines have 16GB
>>>
>>> Jakub
>>>
>>>
>>>
>>>
>>> On 4 July 2016 at 17:44, Mich Talebzadeh 
>>> wrote:
>>>
 Hi Jakub,

 In standalone mode Spark does the resource management. Which version of
 Spark are you running?

 How do you define your SparkConf() parameters for example setMaster
 etc.

 From

 spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
 SparkPOC.jar 10 4.3

 I did not see any executor, memory allocation, so I assume you are
 allocating them somewhere else?

 HTH



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 4 July 2016 at 16:31, Jakub Stransky  wrote:

> Hello,
>
> I have a spark cluster consisting of 4 nodes in a standalone mode,
> master + 3 workers nodes with configured available memory and cpus etc.
>
> I have an spark application which is essentially a MLlib pipeline for
> training a classifier, in this case RandomForest  but could be a
> DecesionTree just for the sake of simplicity.
>
> But when I submit the spark application to the cluster via spark
> submit it is running out of memory. Even though the executors are
> "taken"/created in the cluster they are esentially doing nothing ( poor
> cpu, nor memory utilization) while the master seems to do all the work
> which finally results in OOM.
>
> My submission is following:
> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
> SparkPOC.jar 10 4.3
>
> I am submitting from the master node.
>
> By default it is running in client mode which the driver process is
> attached to spark-shell.
>
> Do I need to set up some settings to make MLlib algos parallelized and
> distributed as well or all is driven by parallel factor set on dataframe
> with input data?
>
> Essentially it seems that all work is just done on master and the rest
> is idle.
> Any hints what to check?
>
> Thx
> Jakub
>
>
>
>

>>>
>>>
>>> --
>>> Jakub Stransky
>>> cz.linkedin.com/in/jakubstransky
>>>
>>>
>>
>
>
> --
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
>
> --
Mathieu Longtin
1-514-803-8977


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
OK spark-submit by default starts its GUI at port :4040. You can
change that using --conf "spark.ui.port=" or any other port.

In GUI what do you see under Environment and Executors tabs. Can you send
the snapshot?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 July 2016 at 19:27, Jakub Stransky  wrote:

> Hi Mich,
>
> sure that workers are mentioned in slaves file. I can see them in spark
> master UI and even after start they are "blocked" for this application but
> the cpu and memory consumption is close to nothing.
>
> Thanks
> Jakub
>
> On 4 July 2016 at 18:36, Mich Talebzadeh 
> wrote:
>
>> Silly question. Have you added your workers to sbin/slaves file and have
>> you started start-slaves.sh.
>>
>> on master node when you type jps what do you see?
>>
>> The problem seems to be that workers are ignored and spark is essentially
>> running in Local mode
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 July 2016 at 17:05, Jakub Stransky  wrote:
>>
>>> Hi Mich,
>>>
>>> I have set up spark default configuration in conf directory
>>> spark-defaults.conf where I specify master hence no need to put it in
>>> command line
>>> spark.master   spark://spark.master:7077
>>>
>>> the same applies to driver memory which has been increased to 4GB
>>>  and the same is for spark.executor.memory 12GB as machines have 16GB
>>>
>>> Jakub
>>>
>>>
>>>
>>>
>>> On 4 July 2016 at 17:44, Mich Talebzadeh 
>>> wrote:
>>>
 Hi Jakub,

 In standalone mode Spark does the resource management. Which version of
 Spark are you running?

 How do you define your SparkConf() parameters for example setMaster
 etc.

 From

 spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
 SparkPOC.jar 10 4.3

 I did not see any executor, memory allocation, so I assume you are
 allocating them somewhere else?

 HTH



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 4 July 2016 at 16:31, Jakub Stransky  wrote:

> Hello,
>
> I have a spark cluster consisting of 4 nodes in a standalone mode,
> master + 3 workers nodes with configured available memory and cpus etc.
>
> I have an spark application which is essentially a MLlib pipeline for
> training a classifier, in this case RandomForest  but could be a
> DecesionTree just for the sake of simplicity.
>
> But when I submit the spark application to the cluster via spark
> submit it is running out of memory. Even though the executors are
> "taken"/created in the cluster they are esentially doing nothing ( poor
> cpu, nor memory utilization) while the master seems to do all the work
> which finally results in OOM.
>
> My submission is following:
> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
> SparkPOC.jar 10 4.3
>
> I am submitting from the master node.
>
> By default it is running in client mode which the driver process is
> attached to spark-shell.
>
> Do I need to set up some settings to make MLlib algos parallelized and
> distributed 

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hi Mich,

sure that workers are mentioned in slaves file. I can see them in spark
master UI and even after start they are "blocked" for this application but
the cpu and memory consumption is close to nothing.

Thanks
Jakub

On 4 July 2016 at 18:36, Mich Talebzadeh  wrote:

> Silly question. Have you added your workers to sbin/slaves file and have
> you started start-slaves.sh.
>
> on master node when you type jps what do you see?
>
> The problem seems to be that workers are ignored and spark is essentially
> running in Local mode
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 July 2016 at 17:05, Jakub Stransky  wrote:
>
>> Hi Mich,
>>
>> I have set up spark default configuration in conf directory
>> spark-defaults.conf where I specify master hence no need to put it in
>> command line
>> spark.master   spark://spark.master:7077
>>
>> the same applies to driver memory which has been increased to 4GB
>>  and the same is for spark.executor.memory 12GB as machines have 16GB
>>
>> Jakub
>>
>>
>>
>>
>> On 4 July 2016 at 17:44, Mich Talebzadeh 
>> wrote:
>>
>>> Hi Jakub,
>>>
>>> In standalone mode Spark does the resource management. Which version of
>>> Spark are you running?
>>>
>>> How do you define your SparkConf() parameters for example setMaster etc.
>>>
>>> From
>>>
>>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>>> SparkPOC.jar 10 4.3
>>>
>>> I did not see any executor, memory allocation, so I assume you are
>>> allocating them somewhere else?
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 4 July 2016 at 16:31, Jakub Stransky  wrote:
>>>
 Hello,

 I have a spark cluster consisting of 4 nodes in a standalone mode,
 master + 3 workers nodes with configured available memory and cpus etc.

 I have an spark application which is essentially a MLlib pipeline for
 training a classifier, in this case RandomForest  but could be a
 DecesionTree just for the sake of simplicity.

 But when I submit the spark application to the cluster via spark submit
 it is running out of memory. Even though the executors are "taken"/created
 in the cluster they are esentially doing nothing ( poor cpu, nor memory
 utilization) while the master seems to do all the work which finally
 results in OOM.

 My submission is following:
 spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
 SparkPOC.jar 10 4.3

 I am submitting from the master node.

 By default it is running in client mode which the driver process is
 attached to spark-shell.

 Do I need to set up some settings to make MLlib algos parallelized and
 distributed as well or all is driven by parallel factor set on dataframe
 with input data?

 Essentially it seems that all work is just done on master and the rest
 is idle.
 Any hints what to check?

 Thx
 Jakub




>>>
>>
>>
>> --
>> Jakub Stransky
>> cz.linkedin.com/in/jakubstransky
>>
>>
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky


RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
I've found a workaround. I set up an http server serving the jar, and pointed 
to the http url in spark submit.
Which brings me to ask would it be a good option to allow spark-submit to 
upload a local jar to the master, which the master can then serve via an http 
interface? The master already runs a web UI, so I imagine we could allow it to 
receive jars, and serve them as well. Perhaps an additional flag could be used 
to signify that the local jar should be uploaded in this manner? I'd be happy 
to take a stab at it...but thoughts?
-Ashic.

From: as...@live.com
To: lohith.sam...@mphasis.com; user@spark.apache.org
Subject: RE: Cluster mode deployment from jar in S3
Date: Mon, 4 Jul 2016 11:30:31 +0100




Hi Lohith,Thanks for the response.
The S3 bucket does have access restrictions, but the instances in which the 
Spark master and workers run have an IAM role policy that allows them access to 
it. As such, we don't really configure the cli with credentials...the IAM roles 
take care of that. Is there a way to make Spark work the same way? Or should I 
get temporary credentials somehow (like 
http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
 ), and use them to somehow submit the job? I guess I'll have to set it via 
environment variables; I can't put it in application code, as the issue is in 
downloading the jar from S3.
-Ashic.

From: lohith.sam...@mphasis.com
To: as...@live.com; user@spark.apache.org
Subject: RE: Cluster mode deployment from jar in S3
Date: Mon, 4 Jul 2016 09:50:50 +









Hi,
The
aws CLI already has your access key aid and secret access key when you 
initially configured it.
Is your s3 bucket without any access restrictions?
 
 

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
 

 


From: Ashic Mahtab [mailto:as...@live.com]


Sent: Monday, July 04, 2016 15.06

To: Apache Spark

Subject: RE: Cluster mode deployment from jar in S3


 

Sorry to do this...but... *bump*

 






From:
as...@live.com

To: user@spark.apache.org

Subject: Cluster mode deployment from jar in S3

Date: Fri, 1 Jul 2016 17:45:12 +0100

Hello,

I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs 
using "--deploy-mode client", however using "--deploy-mode cluster" is proving 
to be a challenge. I've tries this:


 


spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar


 


When I do this, I get:



16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password
 (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
fs.s3.awsSecretAccessKey properties (respectively).


java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId
 or fs.s3.awsSecretAccessKey properties (respectively).


at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)


at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)


at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)


at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)


at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)


at java.lang.reflect.Method.invoke(Method.java:498)


at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)


at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)



 


 


Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and
 Secret as mentioned, but the role the machine's are in allow them to copy the 
jar. In other words, this works:


 


aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar


 


I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting
 that... that works in client mode, but I get a not found error when using 
cluster mode.


 


Any help will be appreciated.


 


Thanks,


Ashic.








Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 

for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 

exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 

to you without proper authority, you are 

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
Silly question. Have you added your workers to sbin/slaves file and have
you started start-slaves.sh.

on master node when you type jps what do you see?

The problem seems to be that workers are ignored and spark is essentially
running in Local mode

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 July 2016 at 17:05, Jakub Stransky  wrote:

> Hi Mich,
>
> I have set up spark default configuration in conf directory
> spark-defaults.conf where I specify master hence no need to put it in
> command line
> spark.master   spark://spark.master:7077
>
> the same applies to driver memory which has been increased to 4GB
>  and the same is for spark.executor.memory 12GB as machines have 16GB
>
> Jakub
>
>
>
>
> On 4 July 2016 at 17:44, Mich Talebzadeh 
> wrote:
>
>> Hi Jakub,
>>
>> In standalone mode Spark does the resource management. Which version of
>> Spark are you running?
>>
>> How do you define your SparkConf() parameters for example setMaster etc.
>>
>> From
>>
>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>> SparkPOC.jar 10 4.3
>>
>> I did not see any executor, memory allocation, so I assume you are
>> allocating them somewhere else?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 July 2016 at 16:31, Jakub Stransky  wrote:
>>
>>> Hello,
>>>
>>> I have a spark cluster consisting of 4 nodes in a standalone mode,
>>> master + 3 workers nodes with configured available memory and cpus etc.
>>>
>>> I have an spark application which is essentially a MLlib pipeline for
>>> training a classifier, in this case RandomForest  but could be a
>>> DecesionTree just for the sake of simplicity.
>>>
>>> But when I submit the spark application to the cluster via spark submit
>>> it is running out of memory. Even though the executors are "taken"/created
>>> in the cluster they are esentially doing nothing ( poor cpu, nor memory
>>> utilization) while the master seems to do all the work which finally
>>> results in OOM.
>>>
>>> My submission is following:
>>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>>> SparkPOC.jar 10 4.3
>>>
>>> I am submitting from the master node.
>>>
>>> By default it is running in client mode which the driver process is
>>> attached to spark-shell.
>>>
>>> Do I need to set up some settings to make MLlib algos parallelized and
>>> distributed as well or all is driven by parallel factor set on dataframe
>>> with input data?
>>>
>>> Essentially it seems that all work is just done on master and the rest
>>> is idle.
>>> Any hints what to check?
>>>
>>> Thx
>>> Jakub
>>>
>>>
>>>
>>>
>>
>
>
> --
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
>
>


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Mathieu,

there is no rocket science there. Essentially creates dataframe and then
call fit from ML pipeline. The thing which I do not understand is how the
parallelization is done in terms of ML algorithm. Is it based on parallel
factor of the dataframe? Because ML algorithm doesn't offer such setting
asfaik. There is only notion of max depth, prunning etc. bot none concerns
with parallelization.

On 4 July 2016 at 17:51, Mathieu Longtin  wrote:

> When the driver is running out of memory, it usually means you're loading
> data in a non parallel way (without using RDD). Make sure anything that
> requires non trivial amount of memory is done by an RDD. Also, the default
> memory for everything is 1GB, which may not be enough for you.
>
> On Mon, Jul 4, 2016 at 11:44 AM Mich Talebzadeh 
> wrote:
>
>> Hi Jakub,
>>
>> In standalone mode Spark does the resource management. Which version of
>> Spark are you running?
>>
>> How do you define your SparkConf() parameters for example setMaster etc.
>>
>> From
>>
>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>> SparkPOC.jar 10 4.3
>>
>> I did not see any executor, memory allocation, so I assume you are
>> allocating them somewhere else?
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 4 July 2016 at 16:31, Jakub Stransky  wrote:
>>
>>> Hello,
>>>
>>> I have a spark cluster consisting of 4 nodes in a standalone mode,
>>> master + 3 workers nodes with configured available memory and cpus etc.
>>>
>>> I have an spark application which is essentially a MLlib pipeline for
>>> training a classifier, in this case RandomForest  but could be a
>>> DecesionTree just for the sake of simplicity.
>>>
>>> But when I submit the spark application to the cluster via spark submit
>>> it is running out of memory. Even though the executors are "taken"/created
>>> in the cluster they are esentially doing nothing ( poor cpu, nor memory
>>> utilization) while the master seems to do all the work which finally
>>> results in OOM.
>>>
>>> My submission is following:
>>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>>> SparkPOC.jar 10 4.3
>>>
>>> I am submitting from the master node.
>>>
>>> By default it is running in client mode which the driver process is
>>> attached to spark-shell.
>>>
>>> Do I need to set up some settings to make MLlib algos parallelized and
>>> distributed as well or all is driven by parallel factor set on dataframe
>>> with input data?
>>>
>>> Essentially it seems that all work is just done on master and the rest
>>> is idle.
>>> Any hints what to check?
>>>
>>> Thx
>>> Jakub
>>>
>>>
>>>
>>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hi Mich,

I have set up spark default configuration in conf directory
spark-defaults.conf where I specify master hence no need to put it in
command line
spark.master   spark://spark.master:7077

the same applies to driver memory which has been increased to 4GB
 and the same is for spark.executor.memory 12GB as machines have 16GB

Jakub




On 4 July 2016 at 17:44, Mich Talebzadeh  wrote:

> Hi Jakub,
>
> In standalone mode Spark does the resource management. Which version of
> Spark are you running?
>
> How do you define your SparkConf() parameters for example setMaster etc.
>
> From
>
> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
> SparkPOC.jar 10 4.3
>
> I did not see any executor, memory allocation, so I assume you are
> allocating them somewhere else?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 July 2016 at 16:31, Jakub Stransky  wrote:
>
>> Hello,
>>
>> I have a spark cluster consisting of 4 nodes in a standalone mode, master
>> + 3 workers nodes with configured available memory and cpus etc.
>>
>> I have an spark application which is essentially a MLlib pipeline for
>> training a classifier, in this case RandomForest  but could be a
>> DecesionTree just for the sake of simplicity.
>>
>> But when I submit the spark application to the cluster via spark submit
>> it is running out of memory. Even though the executors are "taken"/created
>> in the cluster they are esentially doing nothing ( poor cpu, nor memory
>> utilization) while the master seems to do all the work which finally
>> results in OOM.
>>
>> My submission is following:
>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>> SparkPOC.jar 10 4.3
>>
>> I am submitting from the master node.
>>
>> By default it is running in client mode which the driver process is
>> attached to spark-shell.
>>
>> Do I need to set up some settings to make MLlib algos parallelized and
>> distributed as well or all is driven by parallel factor set on dataframe
>> with input data?
>>
>> Essentially it seems that all work is just done on master and the rest is
>> idle.
>> Any hints what to check?
>>
>> Thx
>> Jakub
>>
>>
>>
>>
>


-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky


Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mathieu Longtin
When the driver is running out of memory, it usually means you're loading
data in a non parallel way (without using RDD). Make sure anything that
requires non trivial amount of memory is done by an RDD. Also, the default
memory for everything is 1GB, which may not be enough for you.

On Mon, Jul 4, 2016 at 11:44 AM Mich Talebzadeh 
wrote:

> Hi Jakub,
>
> In standalone mode Spark does the resource management. Which version of
> Spark are you running?
>
> How do you define your SparkConf() parameters for example setMaster etc.
>
> From
>
> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
> SparkPOC.jar 10 4.3
>
> I did not see any executor, memory allocation, so I assume you are
> allocating them somewhere else?
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 4 July 2016 at 16:31, Jakub Stransky  wrote:
>
>> Hello,
>>
>> I have a spark cluster consisting of 4 nodes in a standalone mode, master
>> + 3 workers nodes with configured available memory and cpus etc.
>>
>> I have an spark application which is essentially a MLlib pipeline for
>> training a classifier, in this case RandomForest  but could be a
>> DecesionTree just for the sake of simplicity.
>>
>> But when I submit the spark application to the cluster via spark submit
>> it is running out of memory. Even though the executors are "taken"/created
>> in the cluster they are esentially doing nothing ( poor cpu, nor memory
>> utilization) while the master seems to do all the work which finally
>> results in OOM.
>>
>> My submission is following:
>> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
>> SparkPOC.jar 10 4.3
>>
>> I am submitting from the master node.
>>
>> By default it is running in client mode which the driver process is
>> attached to spark-shell.
>>
>> Do I need to set up some settings to make MLlib algos parallelized and
>> distributed as well or all is driven by parallel factor set on dataframe
>> with input data?
>>
>> Essentially it seems that all work is just done on master and the rest is
>> idle.
>> Any hints what to check?
>>
>> Thx
>> Jakub
>>
>>
>>
>>
> --
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
Try to figure out what the env vars and arguments of the worker JVM and
Python process are. Maybe you'll get a clue.

On Mon, Jul 4, 2016 at 11:42 AM Mathieu Longtin 
wrote:

> I started with a download of 1.6.0. These days, we use a self compiled
> 1.6.2.
>
> On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav 
> wrote:
>
>> I am thinking of any possibilities as to why this could be happening. If
>> the cores are multi-threaded, should that affect the daemons? Your spark
>> was built from source code or downloaded as a binary, though that should
>> not technically change anything?
>>
>> On Mon, Jul 4, 2016 at 9:03 PM, Mathieu Longtin 
>> wrote:
>>
>>> 1.6.1.
>>>
>>> I have no idea. SPARK_WORKER_CORES should do the same.
>>>
>>> On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav 
>>> wrote:
>>>
 Which version of Spark are you using? 1.6.1?

 Any ideas as to why it is not working in ours?

 On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin  wrote:

> 16.
>
> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav 
> wrote:
>
>> Hi,
>>
>> I tried what you suggested and started the slave using the following
>> command:
>>
>> start-slave.sh --cores 1 
>>
>> But it still seems to start as many pyspark daemons as the number of
>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>
>> When you said it helped you and limited it to 2 processes in your
>> cluster, how many cores did each machine have?
>>
>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <
>> math...@closetwork.org> wrote:
>>
>>> It depends on what you want to do:
>>>
>>> If, on any given server, you don't want Spark to use more than one
>>> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
>>> --cores=1
>>>
>>> If you have a bunch of servers dedicated to Spark, but you don't
>>> want a driver to use more than one core per server, then: 
>>> spark.executor.cores=1
>>> tells it not to use more than 1 core per server. However, it seems it 
>>> will
>>> start as many pyspark as there are cores, but maybe not use them.
>>>
>>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
>>> wrote:
>>>
 Hi Mathieu,

 Isn't that the same as setting "spark.executor.cores" to 1? And how
 can I specify "--cores=1" from the application?

 On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
 math...@closetwork.org> wrote:

> When running the executor, put --cores=1. We use this and I only
> see 2 pyspark process, one seem to be the parent of the other and is 
> idle.
>
> In your case, are all pyspark process working?
>
> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>
>> Hi,
>>
>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>> application
>> is run, the load on the workers seems to go more than what was
>> given. When I
>> ran top, I noticed that there were too many Pyspark.daemons
>> processes
>> running. There was another mail thread regarding the same:
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>
>> I followed what was mentioned there, i.e. reduced the number of
>> executor
>> cores and number of executors in one node to 1. But the number of
>> pyspark.daemons process is still not coming down. It looks like
>> initially
>> there is one Pyspark.daemons process and this in turn spawns as
>> many
>> pyspark.daemons processes as the number of cores in the machine.
>>
>> Any help is appreciated :)
>>
>> Thanks,
>> Ashwin Raaghav.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>



 --
 Regards,

 Ashwin Raaghav

>>> --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Ashwin 

Re: Spark application doesn't scale to worker nodes

2016-07-04 Thread Mich Talebzadeh
Hi Jakub,

In standalone mode Spark does the resource management. Which version of
Spark are you running?

How do you define your SparkConf() parameters for example setMaster etc.

From

spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
SparkPOC.jar 10 4.3

I did not see any executor, memory allocation, so I assume you are
allocating them somewhere else?

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 July 2016 at 16:31, Jakub Stransky  wrote:

> Hello,
>
> I have a spark cluster consisting of 4 nodes in a standalone mode, master
> + 3 workers nodes with configured available memory and cpus etc.
>
> I have an spark application which is essentially a MLlib pipeline for
> training a classifier, in this case RandomForest  but could be a
> DecesionTree just for the sake of simplicity.
>
> But when I submit the spark application to the cluster via spark submit it
> is running out of memory. Even though the executors are "taken"/created in
> the cluster they are esentially doing nothing ( poor cpu, nor memory
> utilization) while the master seems to do all the work which finally
> results in OOM.
>
> My submission is following:
> spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
> SparkPOC.jar 10 4.3
>
> I am submitting from the master node.
>
> By default it is running in client mode which the driver process is
> attached to spark-shell.
>
> Do I need to set up some settings to make MLlib algos parallelized and
> distributed as well or all is driven by parallel factor set on dataframe
> with input data?
>
> Essentially it seems that all work is just done on master and the rest is
> idle.
> Any hints what to check?
>
> Thx
> Jakub
>
>
>
>


Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Thanks. I'll try that. Hopefully that should work.

On Mon, Jul 4, 2016 at 9:12 PM, Mathieu Longtin 
wrote:

> I started with a download of 1.6.0. These days, we use a self compiled
> 1.6.2.
>
> On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav 
> wrote:
>
>> I am thinking of any possibilities as to why this could be happening. If
>> the cores are multi-threaded, should that affect the daemons? Your spark
>> was built from source code or downloaded as a binary, though that should
>> not technically change anything?
>>
>> On Mon, Jul 4, 2016 at 9:03 PM, Mathieu Longtin 
>> wrote:
>>
>>> 1.6.1.
>>>
>>> I have no idea. SPARK_WORKER_CORES should do the same.
>>>
>>> On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav 
>>> wrote:
>>>
 Which version of Spark are you using? 1.6.1?

 Any ideas as to why it is not working in ours?

 On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin  wrote:

> 16.
>
> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav 
> wrote:
>
>> Hi,
>>
>> I tried what you suggested and started the slave using the following
>> command:
>>
>> start-slave.sh --cores 1 
>>
>> But it still seems to start as many pyspark daemons as the number of
>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>
>> When you said it helped you and limited it to 2 processes in your
>> cluster, how many cores did each machine have?
>>
>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <
>> math...@closetwork.org> wrote:
>>
>>> It depends on what you want to do:
>>>
>>> If, on any given server, you don't want Spark to use more than one
>>> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
>>> --cores=1
>>>
>>> If you have a bunch of servers dedicated to Spark, but you don't
>>> want a driver to use more than one core per server, then: 
>>> spark.executor.cores=1
>>> tells it not to use more than 1 core per server. However, it seems it 
>>> will
>>> start as many pyspark as there are cores, but maybe not use them.
>>>
>>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
>>> wrote:
>>>
 Hi Mathieu,

 Isn't that the same as setting "spark.executor.cores" to 1? And how
 can I specify "--cores=1" from the application?

 On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
 math...@closetwork.org> wrote:

> When running the executor, put --cores=1. We use this and I only
> see 2 pyspark process, one seem to be the parent of the other and is 
> idle.
>
> In your case, are all pyspark process working?
>
> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>
>> Hi,
>>
>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>> application
>> is run, the load on the workers seems to go more than what was
>> given. When I
>> ran top, I noticed that there were too many Pyspark.daemons
>> processes
>> running. There was another mail thread regarding the same:
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>
>> I followed what was mentioned there, i.e. reduced the number of
>> executor
>> cores and number of executors in one node to 1. But the number of
>> pyspark.daemons process is still not coming down. It looks like
>> initially
>> there is one Pyspark.daemons process and this in turn spawns as
>> many
>> pyspark.daemons processes as the number of cores in the machine.
>>
>> Any help is appreciated :)
>>
>> Thanks,
>> Ashwin Raaghav.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>



 --
 Regards,

 Ashwin Raaghav

>>> --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Ashwin Raaghav
>>
> --
> Mathieu Longtin
> 1-514-803-8977

Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
I started with a download of 1.6.0. These days, we use a self compiled
1.6.2.

On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav  wrote:

> I am thinking of any possibilities as to why this could be happening. If
> the cores are multi-threaded, should that affect the daemons? Your spark
> was built from source code or downloaded as a binary, though that should
> not technically change anything?
>
> On Mon, Jul 4, 2016 at 9:03 PM, Mathieu Longtin 
> wrote:
>
>> 1.6.1.
>>
>> I have no idea. SPARK_WORKER_CORES should do the same.
>>
>> On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav 
>> wrote:
>>
>>> Which version of Spark are you using? 1.6.1?
>>>
>>> Any ideas as to why it is not working in ours?
>>>
>>> On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin 
>>> wrote:
>>>
 16.

 On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav 
 wrote:

> Hi,
>
> I tried what you suggested and started the slave using the following
> command:
>
> start-slave.sh --cores 1 
>
> But it still seems to start as many pyspark daemons as the number of
> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
> file by giving SPARK_WORKER_CORES=1 also didn't help.
>
> When you said it helped you and limited it to 2 processes in your
> cluster, how many cores did each machine have?
>
> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin <
> math...@closetwork.org> wrote:
>
>> It depends on what you want to do:
>>
>> If, on any given server, you don't want Spark to use more than one
>> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
>> --cores=1
>>
>> If you have a bunch of servers dedicated to Spark, but you don't want
>> a driver to use more than one core per server, then: 
>> spark.executor.cores=1
>> tells it not to use more than 1 core per server. However, it seems it 
>> will
>> start as many pyspark as there are cores, but maybe not use them.
>>
>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
>> wrote:
>>
>>> Hi Mathieu,
>>>
>>> Isn't that the same as setting "spark.executor.cores" to 1? And how
>>> can I specify "--cores=1" from the application?
>>>
>>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
>>> math...@closetwork.org> wrote:
>>>
 When running the executor, put --cores=1. We use this and I only
 see 2 pyspark process, one seem to be the parent of the other and is 
 idle.

 In your case, are all pyspark process working?

 On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:

> Hi,
>
> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
> application
> is run, the load on the workers seems to go more than what was
> given. When I
> ran top, I noticed that there were too many Pyspark.daemons
> processes
> running. There was another mail thread regarding the same:
>
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>
> I followed what was mentioned there, i.e. reduced the number of
> executor
> cores and number of executors in one node to 1. But the number of
> pyspark.daemons process is still not coming down. It looks like
> initially
> there is one Pyspark.daemons process and this in turn spawns as
> many
> pyspark.daemons processes as the number of cores in the machine.
>
> Any help is appreciated :)
>
> Thanks,
> Ashwin Raaghav.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
 Mathieu Longtin
 1-514-803-8977

>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
 --
 Mathieu Longtin
 1-514-803-8977

>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
I am thinking of any possibilities as to why this could be happening. If
the cores are multi-threaded, should that affect the daemons? Your spark
was built from source code or downloaded as a binary, though that should
not technically change anything?

On Mon, Jul 4, 2016 at 9:03 PM, Mathieu Longtin 
wrote:

> 1.6.1.
>
> I have no idea. SPARK_WORKER_CORES should do the same.
>
> On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav 
> wrote:
>
>> Which version of Spark are you using? 1.6.1?
>>
>> Any ideas as to why it is not working in ours?
>>
>> On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin 
>> wrote:
>>
>>> 16.
>>>
>>> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav 
>>> wrote:
>>>
 Hi,

 I tried what you suggested and started the slave using the following
 command:

 start-slave.sh --cores 1 

 But it still seems to start as many pyspark daemons as the number of
 cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
 file by giving SPARK_WORKER_CORES=1 also didn't help.

 When you said it helped you and limited it to 2 processes in your
 cluster, how many cores did each machine have?

 On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin  wrote:

> It depends on what you want to do:
>
> If, on any given server, you don't want Spark to use more than one
> core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
> --cores=1
>
> If you have a bunch of servers dedicated to Spark, but you don't want
> a driver to use more than one core per server, then: 
> spark.executor.cores=1
> tells it not to use more than 1 core per server. However, it seems it will
> start as many pyspark as there are cores, but maybe not use them.
>
> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
> wrote:
>
>> Hi Mathieu,
>>
>> Isn't that the same as setting "spark.executor.cores" to 1? And how
>> can I specify "--cores=1" from the application?
>>
>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
>> math...@closetwork.org> wrote:
>>
>>> When running the executor, put --cores=1. We use this and I only see
>>> 2 pyspark process, one seem to be the parent of the other and is idle.
>>>
>>> In your case, are all pyspark process working?
>>>
>>> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>>>
 Hi,

 I am currently using PySpark 1.6.1 in my cluster. When a pyspark
 application
 is run, the load on the workers seems to go more than what was
 given. When I
 ran top, I noticed that there were too many Pyspark.daemons
 processes
 running. There was another mail thread regarding the same:


 https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E

 I followed what was mentioned there, i.e. reduced the number of
 executor
 cores and number of executors in one node to 1. But the number of
 pyspark.daemons process is still not coming down. It looks like
 initially
 there is one Pyspark.daemons process and this in turn spawns as many
 pyspark.daemons processes as the number of cores in the machine.

 Any help is appreciated :)

 Thanks,
 Ashwin Raaghav.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.


 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

 --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Ashwin Raaghav
>>
> --
> Mathieu Longtin
> 1-514-803-8977
>



 --
 Regards,

 Ashwin Raaghav

>>> --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Ashwin Raaghav
>>
> --
> Mathieu Longtin
> 1-514-803-8977
>



-- 
Regards,

Ashwin Raaghav


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
1.6.1.

I have no idea. SPARK_WORKER_CORES should do the same.

On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav  wrote:

> Which version of Spark are you using? 1.6.1?
>
> Any ideas as to why it is not working in ours?
>
> On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin 
> wrote:
>
>> 16.
>>
>> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav 
>> wrote:
>>
>>> Hi,
>>>
>>> I tried what you suggested and started the slave using the following
>>> command:
>>>
>>> start-slave.sh --cores 1 
>>>
>>> But it still seems to start as many pyspark daemons as the number of
>>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>>
>>> When you said it helped you and limited it to 2 processes in your
>>> cluster, how many cores did each machine have?
>>>
>>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin 
>>> wrote:
>>>
 It depends on what you want to do:

 If, on any given server, you don't want Spark to use more than one
 core, use this to start the workers: SPARK_HOME/sbin/start-slave.sh
 --cores=1

 If you have a bunch of servers dedicated to Spark, but you don't want a
 driver to use more than one core per server, then: spark.executor.cores=1
 tells it not to use more than 1 core per server. However, it seems it will
 start as many pyspark as there are cores, but maybe not use them.

 On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
 wrote:

> Hi Mathieu,
>
> Isn't that the same as setting "spark.executor.cores" to 1? And how
> can I specify "--cores=1" from the application?
>
> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin <
> math...@closetwork.org> wrote:
>
>> When running the executor, put --cores=1. We use this and I only see
>> 2 pyspark process, one seem to be the parent of the other and is idle.
>>
>> In your case, are all pyspark process working?
>>
>> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>>
>>> Hi,
>>>
>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>> application
>>> is run, the load on the workers seems to go more than what was
>>> given. When I
>>> ran top, I noticed that there were too many Pyspark.daemons processes
>>> running. There was another mail thread regarding the same:
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>>
>>> I followed what was mentioned there, i.e. reduced the number of
>>> executor
>>> cores and number of executors in one node to 1. But the number of
>>> pyspark.daemons process is still not coming down. It looks like
>>> initially
>>> there is one Pyspark.daemons process and this in turn spawns as many
>>> pyspark.daemons processes as the number of cores in the machine.
>>>
>>> Any help is appreciated :)
>>>
>>> Thanks,
>>> Ashwin Raaghav.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
 --
 Mathieu Longtin
 1-514-803-8977

>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Spark application doesn't scale to worker nodes

2016-07-04 Thread Jakub Stransky
Hello,

I have a spark cluster consisting of 4 nodes in a standalone mode, master +
3 workers nodes with configured available memory and cpus etc.

I have an spark application which is essentially a MLlib pipeline for
training a classifier, in this case RandomForest  but could be a
DecesionTree just for the sake of simplicity.

But when I submit the spark application to the cluster via spark submit it
is running out of memory. Even though the executors are "taken"/created in
the cluster they are esentially doing nothing ( poor cpu, nor memory
utilization) while the master seems to do all the work which finally
results in OOM.

My submission is following:
spark-submit --driver-class-path spark/sqljdbc4.jar --class DemoApp
SparkPOC.jar 10 4.3

I am submitting from the master node.

By default it is running in client mode which the driver process is
attached to spark-shell.

Do I need to set up some settings to make MLlib algos parallelized and
distributed as well or all is driven by parallel factor set on dataframe
with input data?

Essentially it seems that all work is just done on master and the rest is
idle.
Any hints what to check?

Thx
Jakub


Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Which version of Spark are you using? 1.6.1?

Any ideas as to why it is not working in ours?

On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin 
wrote:

> 16.
>
> On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav 
> wrote:
>
>> Hi,
>>
>> I tried what you suggested and started the slave using the following
>> command:
>>
>> start-slave.sh --cores 1 
>>
>> But it still seems to start as many pyspark daemons as the number of
>> cores in the node (1 parent and 3 workers). Limiting it via spark-env.sh
>> file by giving SPARK_WORKER_CORES=1 also didn't help.
>>
>> When you said it helped you and limited it to 2 processes in your
>> cluster, how many cores did each machine have?
>>
>> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin 
>> wrote:
>>
>>> It depends on what you want to do:
>>>
>>> If, on any given server, you don't want Spark to use more than one core,
>>> use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1
>>>
>>> If you have a bunch of servers dedicated to Spark, but you don't want a
>>> driver to use more than one core per server, then: spark.executor.cores=1
>>> tells it not to use more than 1 core per server. However, it seems it will
>>> start as many pyspark as there are cores, but maybe not use them.
>>>
>>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
>>> wrote:
>>>
 Hi Mathieu,

 Isn't that the same as setting "spark.executor.cores" to 1? And how can
 I specify "--cores=1" from the application?

 On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin  wrote:

> When running the executor, put --cores=1. We use this and I only see 2
> pyspark process, one seem to be the parent of the other and is idle.
>
> In your case, are all pyspark process working?
>
> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>
>> Hi,
>>
>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>> application
>> is run, the load on the workers seems to go more than what was given.
>> When I
>> ran top, I noticed that there were too many Pyspark.daemons processes
>> running. There was another mail thread regarding the same:
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>
>> I followed what was mentioned there, i.e. reduced the number of
>> executor
>> cores and number of executors in one node to 1. But the number of
>> pyspark.daemons process is still not coming down. It looks like
>> initially
>> there is one Pyspark.daemons process and this in turn spawns as many
>> pyspark.daemons processes as the number of cores in the machine.
>>
>> Any help is appreciated :)
>>
>> Thanks,
>> Ashwin Raaghav.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>



 --
 Regards,

 Ashwin Raaghav

>>> --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Ashwin Raaghav
>>
> --
> Mathieu Longtin
> 1-514-803-8977
>



-- 
Regards,

Ashwin Raaghav


How to handle update/deletion in Structured Streaming?

2016-07-04 Thread Arnaud Bailly
Hello,

I am interested in using the new Structured Streaming feature of Spark SQL
and am currently doing some experiments on code at HEAD. I would like to
have a better understanding of how deletion should be handled in a
structured streaming setting.

Given some incremental query computing an arbitrary aggregation over some
dataset, inserting new values is somewhat obvious: simply update the
aggregate computation tree with whatever new values are added to input
datasets/datastreams. But things are not so obvious for updates and
deletions: do they have a representation in the input datastreams? If I
have a query that aggregates some value over some key, and I delete all
instances of that key, I would expect the query to output a result removing
the key's aggregated value. The same is true for updates...

Thanks for any insights you might want to share.

Regards,
-- 
Arnaud Bailly

twitter: abailly
skype: arnaud-bailly
linkedin: http://fr.linkedin.com/in/arnaudbailly/


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
16.

On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav  wrote:

> Hi,
>
> I tried what you suggested and started the slave using the following
> command:
>
> start-slave.sh --cores 1 
>
> But it still seems to start as many pyspark daemons as the number of cores
> in the node (1 parent and 3 workers). Limiting it via spark-env.sh file by
> giving SPARK_WORKER_CORES=1 also didn't help.
>
> When you said it helped you and limited it to 2 processes in your cluster,
> how many cores did each machine have?
>
> On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin 
> wrote:
>
>> It depends on what you want to do:
>>
>> If, on any given server, you don't want Spark to use more than one core,
>> use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1
>>
>> If you have a bunch of servers dedicated to Spark, but you don't want a
>> driver to use more than one core per server, then: spark.executor.cores=1
>> tells it not to use more than 1 core per server. However, it seems it will
>> start as many pyspark as there are cores, but maybe not use them.
>>
>> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
>> wrote:
>>
>>> Hi Mathieu,
>>>
>>> Isn't that the same as setting "spark.executor.cores" to 1? And how can
>>> I specify "--cores=1" from the application?
>>>
>>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin 
>>> wrote:
>>>
 When running the executor, put --cores=1. We use this and I only see 2
 pyspark process, one seem to be the parent of the other and is idle.

 In your case, are all pyspark process working?

 On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:

> Hi,
>
> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
> application
> is run, the load on the workers seems to go more than what was given.
> When I
> ran top, I noticed that there were too many Pyspark.daemons processes
> running. There was another mail thread regarding the same:
>
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>
> I followed what was mentioned there, i.e. reduced the number of
> executor
> cores and number of executors in one node to 1. But the number of
> pyspark.daemons process is still not coming down. It looks like
> initially
> there is one Pyspark.daemons process and this in turn spawns as many
> pyspark.daemons processes as the number of cores in the machine.
>
> Any help is appreciated :)
>
> Thanks,
> Ashwin Raaghav.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
 Mathieu Longtin
 1-514-803-8977

>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Ashwin Raaghav
>>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Hi,

I tried what you suggested and started the slave using the following
command:

start-slave.sh --cores 1 

But it still seems to start as many pyspark daemons as the number of cores
in the node (1 parent and 3 workers). Limiting it via spark-env.sh file by
giving SPARK_WORKER_CORES=1 also didn't help.

When you said it helped you and limited it to 2 processes in your cluster,
how many cores did each machine have?

On Mon, Jul 4, 2016 at 8:22 PM, Mathieu Longtin 
wrote:

> It depends on what you want to do:
>
> If, on any given server, you don't want Spark to use more than one core,
> use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1
>
> If you have a bunch of servers dedicated to Spark, but you don't want a
> driver to use more than one core per server, then: spark.executor.cores=1
> tells it not to use more than 1 core per server. However, it seems it will
> start as many pyspark as there are cores, but maybe not use them.
>
> On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav 
> wrote:
>
>> Hi Mathieu,
>>
>> Isn't that the same as setting "spark.executor.cores" to 1? And how can I
>> specify "--cores=1" from the application?
>>
>> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin 
>> wrote:
>>
>>> When running the executor, put --cores=1. We use this and I only see 2
>>> pyspark process, one seem to be the parent of the other and is idle.
>>>
>>> In your case, are all pyspark process working?
>>>
>>> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>>>
 Hi,

 I am currently using PySpark 1.6.1 in my cluster. When a pyspark
 application
 is run, the load on the workers seems to go more than what was given.
 When I
 ran top, I noticed that there were too many Pyspark.daemons processes
 running. There was another mail thread regarding the same:


 https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E

 I followed what was mentioned there, i.e. reduced the number of executor
 cores and number of executors in one node to 1. But the number of
 pyspark.daemons process is still not coming down. It looks like
 initially
 there is one Pyspark.daemons process and this in turn spawns as many
 pyspark.daemons processes as the number of cores in the machine.

 Any help is appreciated :)

 Thanks,
 Ashwin Raaghav.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

 --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Ashwin Raaghav
>>
> --
> Mathieu Longtin
> 1-514-803-8977
>



-- 
Regards,

Ashwin Raaghav


Re: Spark driver assigning splits to incorrect workers

2016-07-04 Thread Raajen Patel
Hi Ted,

Perhaps this might help? Thanks for your response. I am trying to
access/read binary files stored over a series of servers.

Line used to build RDD:
val BIN_pairRDD: RDD[(BIN_Key, BIN_Value)]  =
spark.newAPIHadoopFile("not.used", classOf[BIN_InputFormat],
classOf[BIN_Key], classOf[BIN_Value], config);

In order to support this, we have the following custom classes:
- BIN_Key and BIN_Value as the paired entry for the RDD
- BIN_RecordReader and BIN_FileSplit to handle the special splits
- BIN_FileSplit overrides getLocations() and getLocationInfo(), and we have
verified that the right IP address is being sent to Spark.
- BIN_InputFormat queries a database for details about every split to be
created; as in, which file to read and the IP address where that file is
local.

When it works:
- No problems running a local job
- No problems running in a cluster when there is 1 computer as Master and
another computer with 3 workers along with the files to process.

When it fails:
- When running in a cluster with multiple workers and files spread across
multiple computers. Jobs are not assigned to the nodes where the files are
local.

Thanks,
Raajen


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
It depends on what you want to do:

If, on any given server, you don't want Spark to use more than one core,
use this to start the workers: SPARK_HOME/sbin/start-slave.sh --cores=1

If you have a bunch of servers dedicated to Spark, but you don't want a
driver to use more than one core per server, then: spark.executor.cores=1
tells it not to use more than 1 core per server. However, it seems it will
start as many pyspark as there are cores, but maybe not use them.

On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav  wrote:

> Hi Mathieu,
>
> Isn't that the same as setting "spark.executor.cores" to 1? And how can I
> specify "--cores=1" from the application?
>
> On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin 
> wrote:
>
>> When running the executor, put --cores=1. We use this and I only see 2
>> pyspark process, one seem to be the parent of the other and is idle.
>>
>> In your case, are all pyspark process working?
>>
>> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>>
>>> Hi,
>>>
>>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>>> application
>>> is run, the load on the workers seems to go more than what was given.
>>> When I
>>> ran top, I noticed that there were too many Pyspark.daemons processes
>>> running. There was another mail thread regarding the same:
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>>
>>> I followed what was mentioned there, i.e. reduced the number of executor
>>> cores and number of executors in one node to 1. But the number of
>>> pyspark.daemons process is still not coming down. It looks like initially
>>> there is one Pyspark.daemons process and this in turn spawns as many
>>> pyspark.daemons processes as the number of cores in the machine.
>>>
>>> Any help is appreciated :)
>>>
>>> Thanks,
>>> Ashwin Raaghav.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
>
>
> --
> Regards,
>
> Ashwin Raaghav
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Hi Mathieu,

Isn't that the same as setting "spark.executor.cores" to 1? And how can I
specify "--cores=1" from the application?

On Mon, Jul 4, 2016 at 8:06 PM, Mathieu Longtin 
wrote:

> When running the executor, put --cores=1. We use this and I only see 2
> pyspark process, one seem to be the parent of the other and is idle.
>
> In your case, are all pyspark process working?
>
> On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:
>
>> Hi,
>>
>> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
>> application
>> is run, the load on the workers seems to go more than what was given.
>> When I
>> ran top, I noticed that there were too many Pyspark.daemons processes
>> running. There was another mail thread regarding the same:
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>>
>> I followed what was mentioned there, i.e. reduced the number of executor
>> cores and number of executors in one node to 1. But the number of
>> pyspark.daemons process is still not coming down. It looks like initially
>> there is one Pyspark.daemons process and this in turn spawns as many
>> pyspark.daemons processes as the number of cores in the machine.
>>
>> Any help is appreciated :)
>>
>> Thanks,
>> Ashwin Raaghav.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>



-- 
Regards,

Ashwin Raaghav


Re: Limiting Pyspark.daemons

2016-07-04 Thread Mathieu Longtin
When running the executor, put --cores=1. We use this and I only see 2
pyspark process, one seem to be the parent of the other and is idle.

In your case, are all pyspark process working?

On Mon, Jul 4, 2016 at 3:15 AM ar7  wrote:

> Hi,
>
> I am currently using PySpark 1.6.1 in my cluster. When a pyspark
> application
> is run, the load on the workers seems to go more than what was given. When
> I
> ran top, I noticed that there were too many Pyspark.daemons processes
> running. There was another mail thread regarding the same:
>
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E
>
> I followed what was mentioned there, i.e. reduced the number of executor
> cores and number of executors in one node to 1. But the number of
> pyspark.daemons process is still not coming down. It looks like initially
> there is one Pyspark.daemons process and this in turn spawns as many
> pyspark.daemons processes as the number of cores in the machine.
>
> Any help is appreciated :)
>
> Thanks,
> Ashwin Raaghav.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Mathieu Longtin
1-514-803-8977


Re: Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-04 Thread Pedro Rodriguez
Just realized I had been replying back to only Takeshi.

Thanks for tip as it got me on the right track. Running into an issue with 
private [spark] methods though. It looks like the input metrics start out as 
None and are not initialized (verified by throwing new Exception on pattern 
match cases when it is None and when its not). Looks like NewHadoopRDD calls 
getInputMetricsForReadMethod which sets _inputMetrics if it is None, but it is 
unfortunately it is private [spark]. Is there a way for external RDDs to access 
this method or somehow initialize _inputMetrics in 1.6.X (looks like 2.0 makes 
more of this API public)?

Using reflection I was able to implement it mimicking the NewHadoopRDD code, 
but if possible would like to avoid using reflection. Below is the source code 
for the method that works.

RDD code: 
https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/S3RDD.scala#L100-L105
Reflection code: 
https://github.com/EntilZha/spark-s3/blob/9e632f2a71fba2858df748ed43f0dbb5dae52a83/src/main/scala/io/entilzha/spark/s3/PrivateMethodUtil.scala

Thanks,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn

On July 3, 2016 at 10:31:30 PM, Takeshi Yamamuro (linguin@gmail.com) wrote:

How about using `SparkListener`?
You can collect IO statistics thru TaskMetrics#inputMetrics by yourself.

// maropu

On Mon, Jul 4, 2016 at 11:46 AM, Pedro Rodriguez  
wrote:
Hi All,

I noticed on some Spark jobs it shows you input/output read size. I am 
implementing a custom RDD which reads files and would like to report these 
metrics to Spark since they are available to me.

I looked through the RDD source code and a couple different implementations and 
the best I could find were some Hadoop metrics. Is there a way to simply report 
the number of bytes a partition read so Spark can put it on the UI?

Thanks,
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
UC Berkeley AMPLab Alumni

pedrorodriguez.io | 909-353-4423
github.com/EntilZha | LinkedIn



--
---
Takeshi Yamamuro


Specifying Fixed Duration (Spot Block) for AWS Spark EC2 Cluster

2016-07-04 Thread nsharkey
When I spin up an AWS Spark cluster per the Spark EC2 script:

According to AWS: 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-requests.html#fixed-duration-spot-instances

there is a way of reserving for a fixed duration Spot cluster through AWSCLI
and the web portal but I can't find anything that works with the Spark
script. 

My current (working) script will allow me to get Spot requests but I can't
specify a duration: 
./spark-ec2 \
--key-pair= \
--identity-file= \
--instance-type=r3.8xlarge \
-s 2  \
--spot-price=0.75 \
--block-duration-minutes 120 \
launch spark_rstudio_h2o_cluster

I've tried --block-duration-minutes and --spot-block with no success. 

Thanks. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Specifying-Fixed-Duration-Spot-Block-for-AWS-Spark-EC2-Cluster-tp27278.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to spin up Kafka using docker and use for Spark Streaming Integration tests

2016-07-04 Thread Lars Albertsson
I created such a setup for a client a few months ago. It is pretty
straightforward, but it can take some work to get all the wires
connected.

I suggest that you start with the spotify/kafka
(https://github.com/spotify/docker-kafka) Docker image, since it
includes a bundled zookeeper. The alternative would be to spin up a
separate Zookeeper Docker container and connect them, but for testing
purposes, it would make the setup more complex.

You'll need to inform Kafka about the external address it exposes by
setting ADVERTISED_HOST to the output of "docker-machine ip" (on Mac)
or the address printed by "ip addr show docker0" (Linux). I also
suggest setting
AUTO_CREATE_TOPICS to true.

You can choose to run your Spark Streaming application under test
(SUT) and your test harness also in Docker containers, or directly on
your host.

In the former case, it is easiest to set up a Docker Compose file
linking the harness and SUT to Kafka. This variant provides better
isolation, and might integrate better if you have existing similar
test frameworks.

If you want to run the harness and SUT outside Docker, I suggest that
you build your harness with a standard test framework, e.g. scalatest
or JUnit, and run both harness and SUT in the same JVM. In this case,
you put code to bring up the Kafka Docker container in test framework
setup methods. This test strategy integrates better with IDEs and
build tools (mvn/sbt/gradle), since they will run (and debug) your
tests without any special integration. I therefore prefer this
strategy.


What is the output of your application? If it is messages on a
different Kafka topic, the test harness can merely subscribe and
verify output. If you emit output to a database, you'll need another
Docker container, integrated with Docker Compose. If you are emitting
database entries, your test oracle will need to frequently poll the
database for the expected records, with a timeout in order not to hang
on failing tests.

I hope this is comprehensible. Let me know if you have followup questions.

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
Calendar: https://goo.gl/6FBtlS



On Thu, Jun 30, 2016 at 8:19 PM, SRK  wrote:
> Hi,
>
> I need to do integration tests using Spark Streaming. My idea is to spin up
> kafka using docker locally and use it to feed the stream to my Streaming
> Job. Any suggestions on how to do this would be of great help.
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-spin-up-Kafka-using-docker-and-use-for-Spark-Streaming-Integration-tests-tp27252.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: java.io.FileNotFoundException

2016-07-04 Thread Jacek Laskowski
Can you share some stats from Web UI just before the failure? Any earlier
errors before FNFE?

Jacek
On 4 Jul 2016 12:34 p.m., "kishore kumar"  wrote:

> @jacek: It is running on yarn-client mode, our code don't support running
> in yarn-cluster mode and the job is running for around an hour and giving
> the exception.
>
> @karhi: yarn application status is successful, resourcemanager logs did
> not give any failure info except
> 16/07/04 00:27:57 INFO executor.CoarseGrainedExecutorBackend: Driver
> commanded a shutdown
> 16/07/04 00:27:57 INFO storage.MemoryStore: MemoryStore cleared
> 16/07/04 00:27:57 INFO storage.BlockManager: BlockManager stopped
> 16/07/04 00:27:57 WARN executor.CoarseGrainedExecutorBackend: An unknown (
> slave1.domain.com:56055) driver disconnected.
> 16/07/04 00:27:57 ERROR executor.CoarseGrainedExecutorBackend: Driver
> 173.36.88.26:56055 disassociated! Shutting down.
> 16/07/04 00:27:57 INFO util.ShutdownHookManager: Shutdown hook called
> 16/07/04 00:27:57 INFO util.ShutdownHookManager: Deleting directory
> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/user/appcache/application_1467474162580_29353/spark-9c0bfccc-74c3-4541-a2fd-19101e47b49a
> End of LogType:stderr
>
>
> On Mon, Jul 4, 2016 at 3:20 PM, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> You seem to be using yarn. Is this cluster or client deploy mode? Have
>> you seen any other exceptions before? How long did the application run
>> before the exception?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Jul 4, 2016 at 10:57 AM, kishore kumar 
>> wrote:
>> > We've upgraded spark version from 1.2 to 1.6 still the same problem,
>> >
>> > Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due
>> > to stage failure: Task 286 in stage
>> > 2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage
>> 2397.0
>> > (TID 314416, salve-06.domain.com): java.io.FileNotFoundException:
>> > /opt/mapr/tmp/h
>> >
>> adoop-tmp/hadoop-mapr/nm-local-dir/usercache/user1/appcache/application_1467474162580_29353/blockmgr-bd075392-19c2-4cb8-8033-0fe54d683c8f/12/shuffle_530_286_0.inde
>> > x.c374502a-4cf2-4052-abcf-42977f1623d0 (No such file or directory)
>> >
>> > Kindly help me to get rid from this.
>> >
>> > On Sun, Jun 5, 2016 at 9:43 AM, kishore kumar 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Could anyone help me about this error ? why this error comes ?
>> >>
>> >> Thanks,
>> >> KishoreKuamr.
>> >>
>> >> On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar 
>> >> wrote:
>> >>>
>> >>> Hi Jeff Zhang,
>> >>>
>> >>> Thanks for response, could you explain me why this error occurs ?
>> >>>
>> >>> On Fri, Jun 3, 2016 at 6:15 PM, Jeff Zhang  wrote:
>> 
>>  One quick solution is to use spark 1.6.1.
>> 
>>  On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar > >
>>  wrote:
>> >
>> > Could anyone help me on this issue ?
>> >
>> > On Tue, May 31, 2016 at 8:00 PM, kishore kumar <
>> akishore...@gmail.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> We installed spark1.2.1 in single node, running a job in
>> yarn-client
>> >> mode on yarn which loads data into hbase and elasticsearch,
>> >>
>> >> the error which we are encountering is
>> >> Exception in thread "main" org.apache.spark.SparkException: Job
>> >> aborted due to stage failure: Task 38 in stage 26800.0 failed 4
>> times, most
>> >> recent failure: Lost task 38.3 in stage 26800.0 (TID 4990082,
>> >> hdprd-c01-r04-03): java.io.FileNotFoundException:
>> >>
>> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/sparkuser/appcache/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
>> >> (No such file or directory)
>> >>
>> >> any idea about this error ?
>> >> --
>> >> Thanks,
>> >> Kishore.
>> >
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Kishore.
>> 
>> 
>> 
>> 
>>  --
>>  Best Regards
>> 
>>  Jeff Zhang
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Thanks,
>> >>> Kishore.
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks,
>> >> Kishore.
>> >
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Kishore.
>>
>
>
>
> --
> Thanks,
> Kishore.
>


Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
@jacek: It is running on yarn-client mode, our code don't support running
in yarn-cluster mode and the job is running for around an hour and giving
the exception.

@karhi: yarn application status is successful, resourcemanager logs did not
give any failure info except
16/07/04 00:27:57 INFO executor.CoarseGrainedExecutorBackend: Driver
commanded a shutdown
16/07/04 00:27:57 INFO storage.MemoryStore: MemoryStore cleared
16/07/04 00:27:57 INFO storage.BlockManager: BlockManager stopped
16/07/04 00:27:57 WARN executor.CoarseGrainedExecutorBackend: An unknown (
slave1.domain.com:56055) driver disconnected.
16/07/04 00:27:57 ERROR executor.CoarseGrainedExecutorBackend: Driver
173.36.88.26:56055 disassociated! Shutting down.
16/07/04 00:27:57 INFO util.ShutdownHookManager: Shutdown hook called
16/07/04 00:27:57 INFO util.ShutdownHookManager: Deleting directory
/opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/user/appcache/application_1467474162580_29353/spark-9c0bfccc-74c3-4541-a2fd-19101e47b49a
End of LogType:stderr


On Mon, Jul 4, 2016 at 3:20 PM, Jacek Laskowski  wrote:

> Hi,
>
> You seem to be using yarn. Is this cluster or client deploy mode? Have
> you seen any other exceptions before? How long did the application run
> before the exception?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Jul 4, 2016 at 10:57 AM, kishore kumar 
> wrote:
> > We've upgraded spark version from 1.2 to 1.6 still the same problem,
> >
> > Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due
> > to stage failure: Task 286 in stage
> > 2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage
> 2397.0
> > (TID 314416, salve-06.domain.com): java.io.FileNotFoundException:
> > /opt/mapr/tmp/h
> >
> adoop-tmp/hadoop-mapr/nm-local-dir/usercache/user1/appcache/application_1467474162580_29353/blockmgr-bd075392-19c2-4cb8-8033-0fe54d683c8f/12/shuffle_530_286_0.inde
> > x.c374502a-4cf2-4052-abcf-42977f1623d0 (No such file or directory)
> >
> > Kindly help me to get rid from this.
> >
> > On Sun, Jun 5, 2016 at 9:43 AM, kishore kumar 
> wrote:
> >>
> >> Hi,
> >>
> >> Could anyone help me about this error ? why this error comes ?
> >>
> >> Thanks,
> >> KishoreKuamr.
> >>
> >> On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar 
> >> wrote:
> >>>
> >>> Hi Jeff Zhang,
> >>>
> >>> Thanks for response, could you explain me why this error occurs ?
> >>>
> >>> On Fri, Jun 3, 2016 at 6:15 PM, Jeff Zhang  wrote:
> 
>  One quick solution is to use spark 1.6.1.
> 
>  On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar 
>  wrote:
> >
> > Could anyone help me on this issue ?
> >
> > On Tue, May 31, 2016 at 8:00 PM, kishore kumar <
> akishore...@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> We installed spark1.2.1 in single node, running a job in yarn-client
> >> mode on yarn which loads data into hbase and elasticsearch,
> >>
> >> the error which we are encountering is
> >> Exception in thread "main" org.apache.spark.SparkException: Job
> >> aborted due to stage failure: Task 38 in stage 26800.0 failed 4
> times, most
> >> recent failure: Lost task 38.3 in stage 26800.0 (TID 4990082,
> >> hdprd-c01-r04-03): java.io.FileNotFoundException:
> >>
> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/sparkuser/appcache/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
> >> (No such file or directory)
> >>
> >> any idea about this error ?
> >> --
> >> Thanks,
> >> Kishore.
> >
> >
> >
> >
> > --
> > Thanks,
> > Kishore.
> 
> 
> 
> 
>  --
>  Best Regards
> 
>  Jeff Zhang
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> Kishore.
> >>
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Kishore.
> >
> >
> >
> >
> > --
> > Thanks,
> > Kishore.
>



-- 
Thanks,
Kishore.


RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Hi Lohith,Thanks for the response.
The S3 bucket does have access restrictions, but the instances in which the 
Spark master and workers run have an IAM role policy that allows them access to 
it. As such, we don't really configure the cli with credentials...the IAM roles 
take care of that. Is there a way to make Spark work the same way? Or should I 
get temporary credentials somehow (like 
http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html
 ), and use them to somehow submit the job? I guess I'll have to set it via 
environment variables; I can't put it in application code, as the issue is in 
downloading the jar from S3.
-Ashic.

From: lohith.sam...@mphasis.com
To: as...@live.com; user@spark.apache.org
Subject: RE: Cluster mode deployment from jar in S3
Date: Mon, 4 Jul 2016 09:50:50 +









Hi,
The
aws CLI already has your access key aid and secret access key when you 
initially configured it.
Is your s3 bucket without any access restrictions?
 
 

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
 

 


From: Ashic Mahtab [mailto:as...@live.com]


Sent: Monday, July 04, 2016 15.06

To: Apache Spark

Subject: RE: Cluster mode deployment from jar in S3


 

Sorry to do this...but... *bump*

 






From:
as...@live.com

To: user@spark.apache.org

Subject: Cluster mode deployment from jar in S3

Date: Fri, 1 Jul 2016 17:45:12 +0100

Hello,

I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs 
using "--deploy-mode client", however using "--deploy-mode cluster" is proving 
to be a challenge. I've tries this:


 


spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar


 


When I do this, I get:



16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password
 (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
fs.s3.awsSecretAccessKey properties (respectively).


java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId
 or fs.s3.awsSecretAccessKey properties (respectively).


at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)


at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)


at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)


at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)


at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)


at java.lang.reflect.Method.invoke(Method.java:498)


at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)


at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)



 


 


Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and
 Secret as mentioned, but the role the machine's are in allow them to copy the 
jar. In other words, this works:


 


aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar


 


I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting
 that... that works in client mode, but I get a not found error when using 
cluster mode.


 


Any help will be appreciated.


 


Thanks,


Ashic.








Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 

for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 

exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 

to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 

prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.
  

Re: java.io.FileNotFoundException

2016-07-04 Thread Jacek Laskowski
Hi,

You seem to be using yarn. Is this cluster or client deploy mode? Have
you seen any other exceptions before? How long did the application run
before the exception?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Jul 4, 2016 at 10:57 AM, kishore kumar  wrote:
> We've upgraded spark version from 1.2 to 1.6 still the same problem,
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 286 in stage
> 2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage 2397.0
> (TID 314416, salve-06.domain.com): java.io.FileNotFoundException:
> /opt/mapr/tmp/h
> adoop-tmp/hadoop-mapr/nm-local-dir/usercache/user1/appcache/application_1467474162580_29353/blockmgr-bd075392-19c2-4cb8-8033-0fe54d683c8f/12/shuffle_530_286_0.inde
> x.c374502a-4cf2-4052-abcf-42977f1623d0 (No such file or directory)
>
> Kindly help me to get rid from this.
>
> On Sun, Jun 5, 2016 at 9:43 AM, kishore kumar  wrote:
>>
>> Hi,
>>
>> Could anyone help me about this error ? why this error comes ?
>>
>> Thanks,
>> KishoreKuamr.
>>
>> On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar 
>> wrote:
>>>
>>> Hi Jeff Zhang,
>>>
>>> Thanks for response, could you explain me why this error occurs ?
>>>
>>> On Fri, Jun 3, 2016 at 6:15 PM, Jeff Zhang  wrote:

 One quick solution is to use spark 1.6.1.

 On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar 
 wrote:
>
> Could anyone help me on this issue ?
>
> On Tue, May 31, 2016 at 8:00 PM, kishore kumar 
> wrote:
>>
>> Hi,
>>
>> We installed spark1.2.1 in single node, running a job in yarn-client
>> mode on yarn which loads data into hbase and elasticsearch,
>>
>> the error which we are encountering is
>> Exception in thread "main" org.apache.spark.SparkException: Job
>> aborted due to stage failure: Task 38 in stage 26800.0 failed 4 times, 
>> most
>> recent failure: Lost task 38.3 in stage 26800.0 (TID 4990082,
>> hdprd-c01-r04-03): java.io.FileNotFoundException:
>> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/sparkuser/appcache/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
>> (No such file or directory)
>>
>> any idea about this error ?
>> --
>> Thanks,
>> Kishore.
>
>
>
>
> --
> Thanks,
> Kishore.




 --
 Best Regards

 Jeff Zhang
>>>
>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Kishore.
>>
>>
>>
>>
>> --
>> Thanks,
>> Kishore.
>
>
>
>
> --
> Thanks,
> Kishore.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Lohith Samaga M
Hi,
The aws CLI already has your access key aid and secret access 
key when you initially configured it.
Is your s3 bucket without any access restrictions?


Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04, 2016 15.06
To: Apache Spark
Subject: RE: Cluster mode deployment from jar in S3

Sorry to do this...but... *bump*


From: as...@live.com
To: user@spark.apache.org
Subject: Cluster mode deployment from jar in S3
Date: Fri, 1 Jul 2016 17:45:12 +0100
Hello,
I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs 
using "--deploy-mode client", however using "--deploy-mode cluster" is proving 
to be a challenge. I've tries this:

spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar

When I do this, I get:
16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).
at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)


Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and Secret as mentioned, but the role 
the machine's are in allow them to copy the jar. In other words, this works:

aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar

I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting that... that works in 
client mode, but I get a not found error when using cluster mode.

Any help will be appreciated.

Thanks,
Ashic.
Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Sorry to do this...but... *bump*
From: as...@live.com
To: user@spark.apache.org
Subject: Cluster mode deployment from jar in S3
Date: Fri, 1 Jul 2016 17:45:12 +0100




Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit 
jobs using "--deploy-mode client", however using "--deploy-mode cluster" is 
proving to be a challenge. I've tries this:
spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster 
s3://bucket/dir/foo.jar
When I do this, I get:

16/07/01 16:23:16 ERROR ClientEndpoint: Exception from cluster was: 
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key 
must be specified as the username or password (respectively) of a s3 URL, or by 
setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties 
(respectively).java.lang.IllegalArgumentException: AWS Access Key ID and Secret 
Access Key must be specified as the username or password (respectively) of a s3 
URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey 
properties (respectively).at 
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at 
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)

Now I'm not using any S3 or hadoop stuff within my code (it's just an 
sc.parallelize(1 to 100)). So, I imagine it's the driver trying to fetch the 
jar. I haven't set the AWS Access Key Id and Secret as mentioned, but the role 
the machine's are in allow them to copy the jar. In other words, this works:
aws s3 cp s3://bucket/dir/foo.jar /tmp/foo.jar
I'm using Spark 1.6.2, and can't really think of what I can do so that I can 
submit the jar from s3 using cluster deploy mode. I've also tried simply 
downloading the jar onto a node, and spark-submitting that... that works in 
client mode, but I get a not found error when using cluster mode.
Any help will be appreciated.
Thanks,Ashic.   
  

Re: Graphframe Error

2016-07-04 Thread Felix Cheung
It looks like either the extracted Python code is corrupted or there is a 
mismatch Python version. Are you using Python 3?


stackoverflow.com/questions/514371/whats-the-bad-magic-number-error





On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" 
> wrote:

Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven 
repository, it was not affected by the location where the jar file was placed. 
Your examples works well in my laptop.

Or you can use try with


bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and 
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel 
>:
I started my pyspark shell with command  (I am using spark 1.6).

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

I have copied 
http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
 to the lib directory of Spark as well.

I was getting below error

>>> from graphframes import *
Traceback (most recent call last):
  File "", line 1, in 
zipimport.ZipImportError: can't find module 'graphframes'
>>>

So, as per suggestions from similar questions, I have extracted the graphframes 
python directory and copied to the local directory where I am running pyspark.

>>> from graphframes import *

But, not able to create the GraphFrame

>>> g = GraphFrame(v, e)
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'GraphFrame' is not defined

Also, I am getting below error.
>>> from graphframes.examples import Graphs
Traceback (most recent call last):
  File "", line 1, in 
ImportError: Bad magic number in graphframes/examples.pyc

Any help will be highly appreciated.

- Arun



Re: java.io.FileNotFoundException

2016-07-04 Thread karthi keyan
kishore,

Could you please post the application master logs ??? will get us more in
details.

Best,
Karthik

On Mon, Jul 4, 2016 at 2:27 PM, kishore kumar  wrote:

> We've upgraded spark version from 1.2 to 1.6 still the same problem,
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 286 in stage
> 2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage
> 2397.0 (TID 314416, salve-06.domain.com): java.io.FileNotFoundException:
> /opt/mapr/tmp/h
>
> adoop-tmp/hadoop-mapr/nm-local-dir/usercache/user1/appcache/application_1467474162580_29353/blockmgr-bd075392-19c2-4cb8-8033-0fe54d683c8f/12/shuffle_530_286_0.inde
> x.c374502a-4cf2-4052-abcf-42977f1623d0 (No such file or directory)
>
> Kindly help me to get rid from this.
>
> On Sun, Jun 5, 2016 at 9:43 AM, kishore kumar 
> wrote:
>
>> Hi,
>>
>> Could anyone help me about this error ? why this error comes ?
>>
>> Thanks,
>> KishoreKuamr.
>>
>> On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar 
>> wrote:
>>
>>> Hi Jeff Zhang,
>>>
>>> Thanks for response, could you explain me why this error occurs ?
>>>
>>> On Fri, Jun 3, 2016 at 6:15 PM, Jeff Zhang  wrote:
>>>
 One quick solution is to use spark 1.6.1.

 On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar 
 wrote:

> Could anyone help me on this issue ?
>
> On Tue, May 31, 2016 at 8:00 PM, kishore kumar 
> wrote:
>
>> Hi,
>>
>> We installed spark1.2.1 in single node, running a job in yarn-client
>> mode on yarn which loads data into hbase and elasticsearch,
>>
>> the error which we are encountering is
>> Exception in thread "main" org.apache.spark.SparkException: Job
>> aborted due to stage failure: Task 38 in stage 26800.0 failed 4 times, 
>> most
>> recent failure: Lost task 38.3 in stage 26800.0 (TID 4990082,
>> hdprd-c01-r04-03): java.io.FileNotFoundException:
>> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/sparkuser/appcache/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
>> (No such file or directory)
>>
>> any idea about this error ?
>> --
>> Thanks,
>> Kishore.
>>
>
>
>
> --
> Thanks,
> Kishore.
>



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>>
>>> --
>>> Thanks,
>>> Kishore.
>>>
>>
>>
>>
>> --
>> Thanks,
>> Kishore.
>>
>
>
>
> --
> Thanks,
> Kishore.
>


Re: java.io.FileNotFoundException

2016-07-04 Thread kishore kumar
We've upgraded spark version from 1.2 to 1.6 still the same problem,

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 286 in stage
2397.0 failed 4 times, most recent failure: Lost task 286.3 in stage 2397.0
(TID 314416, salve-06.domain.com): java.io.FileNotFoundException:
/opt/mapr/tmp/h
adoop-tmp/hadoop-mapr/nm-local-dir/usercache/user1/appcache/application_1467474162580_29353/blockmgr-bd075392-19c2-4cb8-8033-0fe54d683c8f/12/shuffle_530_286_0.inde
x.c374502a-4cf2-4052-abcf-42977f1623d0 (No such file or directory)

Kindly help me to get rid from this.

On Sun, Jun 5, 2016 at 9:43 AM, kishore kumar  wrote:

> Hi,
>
> Could anyone help me about this error ? why this error comes ?
>
> Thanks,
> KishoreKuamr.
>
> On Fri, Jun 3, 2016 at 9:12 PM, kishore kumar 
> wrote:
>
>> Hi Jeff Zhang,
>>
>> Thanks for response, could you explain me why this error occurs ?
>>
>> On Fri, Jun 3, 2016 at 6:15 PM, Jeff Zhang  wrote:
>>
>>> One quick solution is to use spark 1.6.1.
>>>
>>> On Fri, Jun 3, 2016 at 8:35 PM, kishore kumar 
>>> wrote:
>>>
 Could anyone help me on this issue ?

 On Tue, May 31, 2016 at 8:00 PM, kishore kumar 
 wrote:

> Hi,
>
> We installed spark1.2.1 in single node, running a job in yarn-client
> mode on yarn which loads data into hbase and elasticsearch,
>
> the error which we are encountering is
> Exception in thread "main" org.apache.spark.SparkException: Job
> aborted due to stage failure: Task 38 in stage 26800.0 failed 4 times, 
> most
> recent failure: Lost task 38.3 in stage 26800.0 (TID 4990082,
> hdprd-c01-r04-03): java.io.FileNotFoundException:
> /opt/mapr/tmp/hadoop-tmp/hadoop-mapr/nm-local-dir/usercache/sparkuser/appcache/application_1463194314221_211370/spark-3cc37dc7-fa3c-4b98-aa60-0acdfc79c725/28/shuffle_8553_38_0.index
> (No such file or directory)
>
> any idea about this error ?
> --
> Thanks,
> Kishore.
>



 --
 Thanks,
 Kishore.

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Thanks,
>> Kishore.
>>
>
>
>
> --
> Thanks,
> Kishore.
>



-- 
Thanks,
Kishore.


Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven
repository, it was not affected by the location where the jar file was
placed. Your examples works well in my laptop.

Or you can use try with

bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel :

> I started my pyspark shell with command  (I am using spark 1.6).
>
> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>
> I have copied
> http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
> to the lib directory of Spark as well.
>
> I was getting below error
>
> >>> from graphframes import *
> Traceback (most recent call last):
>   File "", line 1, in 
> zipimport.ZipImportError: can't find module 'graphframes'
> >>>
>
> So, as per suggestions from similar questions, I have extracted the
> graphframes python directory and copied to the local directory where I am
> running pyspark.
>
> >>> from graphframes import *
>
> But, not able to create the GraphFrame
>
> >>> g = GraphFrame(v, e)
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'GraphFrame' is not defined
>
> Also, I am getting below error.
> >>> from graphframes.examples import Graphs
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: Bad magic number in graphframes/examples.pyc
>
> Any help will be highly appreciated.
>
> - Arun
>


Limiting Pyspark.daemons

2016-07-04 Thread ar7
Hi,

I am currently using PySpark 1.6.1 in my cluster. When a pyspark application
is run, the load on the workers seems to go more than what was given. When I
ran top, I noticed that there were too many Pyspark.daemons processes
running. There was another mail thread regarding the same:

https://mail-archives.apache.org/mod_mbox/spark-user/201606.mbox/%3ccao429hvi3drc-ojemue3x4q1vdzt61htbyeacagtre9yrhs...@mail.gmail.com%3E

I followed what was mentioned there, i.e. reduced the number of executor
cores and number of executors in one node to 1. But the number of
pyspark.daemons process is still not coming down. It looks like initially
there is one Pyspark.daemons process and this in turn spawns as many
pyspark.daemons processes as the number of cores in the machine. 

Any help is appreciated :)

Thanks,
Ashwin Raaghav.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Limiting-Pyspark-daemons-tp27272.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



ORC or parquet with Spark

2016-07-04 Thread Ashok Kumar
With Spark caching which file format is best to use parquet or ORC
Obviously ORC can be used with Hive. 
My question is whether Spark can use various file, stripe rowset statistics 
stored in ORC file?
Otherwise to me both parquet and ORC are files simply kept on HDFS. They do not 
offer any caching to be faster.
So if Spark ignores the underlying stats for ORC files, does it matter which 
file format to use with Spark.
Thanks