NoClassDefFound exception after setting spark.eventLog.enabled=true

2016-09-02 Thread C. Josephson
I use Spark 1.6.2 with Java, and after I set spark.eventLog.enabled=true
spark crashes with this exception:

Exception in thread "main" java.lang.NoClassDefFoundError:
org/json4s/jackson/JsonMethods$
at
org.apache.spark.scheduler.EventLoggingListener$.initEventLog(EventLoggingListener.scala:257)
at
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:124)
at org.apache.spark.SparkContext.(SparkContext.scala:519)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at
com.uhana.stream.UhanaStreamingContext.(UhanaStreamingContext.java:165)
at
com.uhana.stream.UhanaStreamingContext.(UhanaStreamingContext.java:23)
at
com.uhana.stream.UhanaStreamingContext$Builder.build(UhanaStreamingContext.java:159)
at com.uhana.stream.app.StreamingMain.main(StreamingMain.java:41)
Caused by: java.lang.ClassNotFoundException: org.json4s.jackson.JsonMethods$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 8 more

I downloaded some jars for json4s, but I'm not sure where they should go.

Thanks,
-cjosephson


Re: Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Mich Talebzadeh
Hi,

You can create Hive external  tables on top of existing Hbase table using
the property

STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

Example

hive> show create table hbase_table;
OK
CREATE TABLE `hbase_table`(
  `key` int COMMENT '',
  `value1` string COMMENT '',
  `value2` int COMMENT '',
  `value3` int COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  'hbase.columns.mapping'=':key,a:b,a:c,d:e',
  'serialization.format'='1')
TBLPROPERTIES (
  'transient_lastDdlTime'='1472370939')

 Then try to access this Hive table from Spark which is giving me grief at
the moment :(

scala> HiveContext.sql("use test")
res9: org.apache.spark.sql.DataFrame = []
scala> val hbase_table= spark.table("hbase_table")
16/09/02 23:31:07 ERROR log: error in initSerDe:
java.lang.ClassNotFoundException Class
org.apache.hadoop.hive.hbase.HBaseSerDe not found

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 23:08, KhajaAsmath Mohammed 
wrote:

> Hi Kim,
>
> I am also looking for same information. Just got the same requirement
> today.
>
> Thanks,
> Asmath
>
> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  wrote:
>
>> I was wondering if anyone has tried to create Spark SQL tables on top of
>> HBase tables so that data in HBase can be accessed using Spark Thriftserver
>> with SQL statements? This is similar what can be done using Hive.
>>
>> Thanks,
>> Ben
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Spark SQL Tables on top of HBase Tables

2016-09-02 Thread ayan guha
You can either read hbase in rdd and then turn it to a df or expose hbase
tables using hive and read from hive or use phoenix
On 3 Sep 2016 08:08, "KhajaAsmath Mohammed"  wrote:

> Hi Kim,
>
> I am also looking for same information. Just got the same requirement
> today.
>
> Thanks,
> Asmath
>
> On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  wrote:
>
>> I was wondering if anyone has tried to create Spark SQL tables on top of
>> HBase tables so that data in HBase can be accessed using Spark Thriftserver
>> with SQL statements? This is similar what can be done using Hive.
>>
>> Thanks,
>> Ben
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Spark SQL Tables on top of HBase Tables

2016-09-02 Thread KhajaAsmath Mohammed
Hi Kim,

I am also looking for same information. Just got the same requirement today.

Thanks,
Asmath

On Fri, Sep 2, 2016 at 4:46 PM, Benjamin Kim  wrote:

> I was wondering if anyone has tried to create Spark SQL tables on top of
> HBase tables so that data in HBase can be accessed using Spark Thriftserver
> with SQL statements? This is similar what can be done using Hive.
>
> Thanks,
> Ben
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Benjamin Kim
I was wondering if anyone has tried to create Spark SQL tables on top of HBase 
tables so that data in HBase can be accessed using Spark Thriftserver with SQL 
statements? This is similar what can be done using Hive.

Thanks,
Ben


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread sagarcasual .
Hi Cody, thanks for the reply.
I am using Spark 1.6.1 with Kafka 0.9.
When I want to stop streaming, stopping the context sounds ok, but for
temporarily excluding partitions is there any way I can supply
topic-partition info on the fly at the beginning of every pull dynamically.
Will streaminglistener be of any help?

On Fri, Sep 2, 2016 at 10:37 AM, Cody Koeninger  wrote:

> If you just want to pause the whole stream, just stop the app and then
> restart it when you're ready.
>
> If you want to do some type of per-partition manipulation, you're
> going to need to write some code.  The 0.10 integration makes the
> underlying kafka consumer pluggable, so you may be able to wrap a
> consumer to do what you need.
>
> On Fri, Sep 2, 2016 at 12:28 PM, sagarcasual . 
> wrote:
> > Hello, this is for
> > Pausing spark kafka streaming (direct) or exclude/include some
> partitions on
> > the fly per batch
> > =
> > I have following code that creates a direct stream using Kafka connector
> for
> > Spark.
> >
> > final JavaInputDStream msgRecords =
> > KafkaUtils.createDirectStream(
> > jssc, String.class, String.class, StringDecoder.class,
> > StringDecoder.class,
> > KafkaMessage.class, kafkaParams, topicsPartitions,
> > message -> {
> > return KafkaMessage.builder()
> > .
> > .build();
> > }
> > );
> >
> > However I want to handle a situation, where I can decide that this
> streaming
> > needs to pause for a while on conditional basis, is there any way to
> achieve
> > this? Say my Kafka is undergoing some maintenance, so between 10AM to
> 12PM
> > stop processing, and then again pick up at 12PM from the last offset,
> how do
> > I do it?
> >
> > Also, assume all of a sudden we want to take one-or-more of the
> partitions
> > for a pull and add it back after some pulls, how do I achieve that?
> >
> > -Regards
> > Sagar
> >
>


Re: Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Mich Talebzadeh
Since you are using Spark Thrift Server (which in turn uses Hive Thrift
Server) I have this suspicion that it uses Hive optimiser which indicates
that stats do matter. However, that may be just an assumption.

Have you partitioned these parquet tables?

Is it worth logging to Hive and run the same queries in Hive with EXPLAIN
EXTENDED select .. Can you see whether the relevant partition is picked
up?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 12:03, Сергей Романов  wrote:

> Hi, Mich,
>
> Column x29 does not seems to be any special. It's a newly created table
> and I did not calculate stats for any columns. Actually, I can sum a single
> column several times in query and face some landshift performance hit at
> some "magic" point. Setting "set spark.sql.codegen.wholeStage=false"
> makes all requests run in a similar slow time (which is slower original
> time).
>
> PS. Does Stats even helps for Spark queries?
>
> SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY
> field
>
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`,  SUM(`dd_convs`) AS `dd_convs`  FROM
> `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY
> `advertiser_id`  LIMIT 30;
> 30 rows selected (1.37 seconds)
> 30 rows selected (1.382 seconds)
> 30 rows selected (1.399 seconds)
>
> SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY
> field
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`,  SUM(`actual_dsp_fee`) AS `actual_dsp_fee`  FROM
> `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY
> `advertiser_id`  LIMIT 30;
> 30 rows selected (1.379 seconds)
> 30 rows selected (1.382 seconds)
> 30 rows selected (1.377 seconds)
>
> SELECT field, SUM(x28) x repeat 40 times FROM parquet_table WHERE
> partition = 1 GROUP BY field -> 1.774s
>
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS
> `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM
> `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY
> `advertiser_id`  LIMIT 30;
> 30 rows selected (1.774 seconds)
> 30 rows selected (1.721 seconds)
> 30 rows selected (2.813 seconds)
>
> SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE
> partition = 1 GROUP BY field -> 7.314s
>
> 0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS
> `advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS
> `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`)  FROM `slicer`.`573_slicer_rnd_13`  WHERE dt =
> '2016-07-28'  GROUP BY `advertiser_id`  LIMIT 30;
> 30 rows selected (7.314 seconds)
> 30 rows selected (7.27 seconds)
> 30 rows selected (7.279 seconds)
>
> SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1
> -> 1.378s
>
> 0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS
> `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`),
> SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 

Re: Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread Mich Talebzadeh
Hi,

As I understand Spark memory allocation is used for execution ,memory and
storage memory. The sum is deterministic (memory allocated in simplest
form). So by using storage cache you impact the sum.

Now


   1. cache() is an alias to persist(memory_only)
   2. caching is only done once.
   3. Both dataframes and rdds can be cached.


If you cache rdd or df it will persist in memory until it is evicted as
Spark uses an LRU (Least Recently Used) chain. So if your rdd is moderately
small and it is accessed iteratively, then caching it would be advantages
for faster access. Otherwise, leave it as it is.  Spark doc
explains
this.

You can perform some tests by running both approaches and check Spark UI
(default port 4040) under Storage tab to see the amount of data cached.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 21:21, Davies Liu  wrote:

> Caching a RDD/DataFrame always has some cost, in this case, I'd suggest
> that
> do not cache the DataFrame, the first() is usually fast enough (only
> compute the
> partitions as needed).
>
> On Fri, Sep 2, 2016 at 1:05 PM, apu  wrote:
> > When I first learnt Spark, I was told that cache() is desirable anytime
> one
> > performs more than one Action on an RDD or DataFrame. For example,
> consider
> > the PySpark toy example below; it shows two approaches to doing the same
> > thing.
> >
> > # Approach 1 (bad?)
> > df2 = someTransformation(df1)
> > a = df2.count()
> > b = df2.first() # This step could take long, because df2 has to be
> created
> > all over again
> >
> > # Approach 2 (good?)
> > df2 = someTransformation(df1)
> > df2.cache()
> > a = df2.count()
> > b = df2.first() # Because df2 is already cached, this action is quick
> > df2.unpersist()
> >
> > The second approach shown above is somewhat clunky, because it requires
> one
> > to cache any dataframe that will be Acted on more than once, followed by
> the
> > need to call unpersist() later to free up memory.
> >
> > So my question is: is the second approach still necessary/desirable when
> > operating on DataFrames in newer versions of Spark (>=1.6)?
> >
> > Thanks!!
> >
> > Apu
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


how to pass trustStore path into pyspark ?

2016-09-02 Thread Eric Ho
I'm trying to pass a trustStore pathname into pyspark.
What env variable and/or config file or script I need to change to do this ?
I've tried setting JAVA_OPTS env var but to no avail...
any pointer much appreciated...  thx

-- 

-eric ho


Re: Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread Davies Liu
Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that
do not cache the DataFrame, the first() is usually fast enough (only compute the
partitions as needed).

On Fri, Sep 2, 2016 at 1:05 PM, apu  wrote:
> When I first learnt Spark, I was told that cache() is desirable anytime one
> performs more than one Action on an RDD or DataFrame. For example, consider
> the PySpark toy example below; it shows two approaches to doing the same
> thing.
>
> # Approach 1 (bad?)
> df2 = someTransformation(df1)
> a = df2.count()
> b = df2.first() # This step could take long, because df2 has to be created
> all over again
>
> # Approach 2 (good?)
> df2 = someTransformation(df1)
> df2.cache()
> a = df2.count()
> b = df2.first() # Because df2 is already cached, this action is quick
> df2.unpersist()
>
> The second approach shown above is somewhat clunky, because it requires one
> to cache any dataframe that will be Acted on more than once, followed by the
> need to call unpersist() later to free up memory.
>
> So my question is: is the second approach still necessary/desirable when
> operating on DataFrames in newer versions of Spark (>=1.6)?
>
> Thanks!!
>
> Apu

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread apu
When I first learnt Spark, I was told that *cache()* is desirable anytime
one performs more than one Action on an RDD or DataFrame. For example,
consider the PySpark toy example below; it shows two approaches to doing
the same thing.

# Approach 1 (bad?)
df2 = someTransformation(df1)
a = df2.count()
b = df2.first() # This step could take long, because df2 has to be created
all over again

# Approach 2 (good?)
df2 = someTransformation(df1)
df2.cache()
a = df2.count()
b = df2.first() # Because df2 is already cached, this action is quick
df2.unpersist()

The second approach shown above is somewhat clunky, because it requires one
to cache any dataframe that will be Acted on more than once, followed by
the need to call *unpersist()* later to free up memory.

*So my question is: is the second approach still necessary/desirable when
operating on DataFrames in newer versions of Spark (>=1.6)?*

Thanks!!

Apu


Reset auto.offset.reset in Kafka 0.10 integ

2016-09-02 Thread Srikanth
Hi,

Upon restarting my Spark Streaming app it is failing with error

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 1.0 (TID 6, localhost):
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of
range with no configured reset policy for partitions: {mt-event-2=1710706}

It is correct that the last read offset was deleted by kafka due to
retention period expiry.
I've set auto.offset.reset in my app but it is getting reset here

https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaUtils.scala#L160

How to force it to restart in this case (fully aware of potential data
loss)?

Srikanth


Re: Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Jakob Odersky
Spark currently requires at least Java 1.7, so adding a Java
1.8-specific encoder will not be straightforward without affecting
requirements. I can think of two solutions:

1. add a Java 1.8 build profile which includes such encoders (this may
be useful for Scala 2.12 support in the future as well)
2. expose a custom Encoder API (the current one is not easily extensible)

I would personally favor solution number 2 as it avoids adding yet
another build configuration to choose from, however I am not sure how
feasible it is to make custom encoders play nice with Catalyst.

To get back to your question, I don't think there are currently any
plans and I would recommend you work around the issue by converting to
the old Date API
http://stackoverflow.com/questions/33066904/localdate-to-java-util-date-and-vice-versa-simpliest-conversion

On Fri, Sep 2, 2016 at 8:29 AM, Daniel Siegmann
 wrote:
> It seems Spark can handle case classes with java.sql.Date, but not
> java.time.LocalDate. It complains there's no encoder.
>
> Are there any plans to add an encoder for LocalDate (and other classes in
> the new Java 8 Time and Date API), or is there an existing library I can use
> that provides encoders?
>
> --
> Daniel Siegmann
> Senior Software Engineer
> SecurityScorecard Inc.
> 214 W 29th Street, 5th Floor
> New York, NY 10001
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
And, no, Spark's scheduler will not preempt already running Tasks.  In
fact, just killing running Tasks for any reason is trickier than we'd like
it to be, so it isn't done by default:
https://issues.apache.org/jira/browse/SPARK-17064

On Fri, Sep 2, 2016 at 11:34 AM, Mark Hamstra 
wrote:

> The comparator is used in `Pool#getSortedTaskSetQueue`.  The
> `TaskSchedulerImpl` calls that on the rootPool when the TaskScheduler needs
> to handle `resourceOffers` for available Executor cores.  Creation of the
> `sortedTaskSets` is a recursive, nested sorting of the `Schedulable`
> entities -- you can have pools within pools within pools within... if you
> really want to, but they eventually bottom out in TaskSetManagers.  The
> `sortedTaskSets` is a flattened queue of the TaskSets, and the available
> cores are offered to those TaskSets in that queued order until the next
> time the scheduler backend handles the available resource offers and a new
> `sortedTaskSets` is generated.
>
> On Fri, Sep 2, 2016 at 2:37 AM, enrico d'urso  wrote:
>
>> Thank you.
>>
>> May I know when that comparator is called?
>> It looks like spark scheduler has not any form of preemption, am I right?
>>
>> Thank you
>> --
>> *From:* Mark Hamstra 
>> *Sent:* Thursday, September 1, 2016 8:44:10 PM
>>
>> *To:* enrico d'urso
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark scheduling mode
>>
>> Spark's FairSchedulingAlgorithm is not round robin:
>> https://github.com/apache/spark/blob/master/core/src/
>> main/scala/org/apache/spark/scheduler/SchedulingAlgorithm.scala#L43
>>
>> When at the scope of fair scheduling Jobs within a single Pool, the
>> Schedulable entities being handled (s1 and s2) are TaskSetManagers, which
>> are at the granularity of Stages, not Jobs.  Since weight is 1 and minShare
>> is 0 for TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers
>> just boils down to prioritizing TaskSets (i.e. Stages) with the fewest
>> number of runningTasks.
>>
>> On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso  wrote:
>>
>>> I tried it before, but still I am not able to see a proper round robin
>>> across the jobs I submit.
>>> Given this:
>>>
>>> 
>>> FAIR
>>> 1
>>> 2
>>>   
>>>
>>> Each jobs inside production pool should be scheduled in round robin way,
>>> am I right?
>>>
>>> --
>>> *From:* Mark Hamstra 
>>> *Sent:* Thursday, September 1, 2016 8:19:44 PM
>>> *To:* enrico d'urso
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Spark scheduling mode
>>>
>>> The default pool (``) can be configured like any
>>> other pool: https://spark.apache.org/docs/latest/job-scheduling.ht
>>> ml#configuring-pool-properties
>>>
>>> On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso  wrote:
>>>
 Is there a way to force scheduling to be fair *inside* the default
 pool?
 I mean, round robin for the jobs that belong to the default pool.

 Cheers,
 --
 *From:* Mark Hamstra 
 *Sent:* Thursday, September 1, 2016 7:24:54 PM
 *To:* enrico d'urso
 *Cc:* user@spark.apache.org
 *Subject:* Re: Spark scheduling mode

 Just because you've flipped spark.scheduler.mode to FAIR, that doesn't
 mean that Spark can magically configure and start multiple scheduling pools
 for you, nor can it know to which pools you want jobs assigned.  Without
 doing any setup of additional scheduling pools or assigning of jobs to
 pools, you're just dumping all of your jobs into the one available default
 pool (which is now being fair scheduled with an empty set of other pools)
 and the scheduling of jobs within that pool is still the default intra-pool
 scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
 flipping spark.scheduler.mode to FAIR.

 On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso  wrote:

> I am building a Spark App, in which I submit several jobs (pyspark). I
> am using threads to run them in parallel, and also I am setting:
> conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
> serially in FIFO way. Am I missing something?
>
> Cheers,
>
>
> Enrico
>


>>>
>>
>


Re: Spark scheduling mode

2016-09-02 Thread Mark Hamstra
The comparator is used in `Pool#getSortedTaskSetQueue`.  The
`TaskSchedulerImpl` calls that on the rootPool when the TaskScheduler needs
to handle `resourceOffers` for available Executor cores.  Creation of the
`sortedTaskSets` is a recursive, nested sorting of the `Schedulable`
entities -- you can have pools within pools within pools within... if you
really want to, but they eventually bottom out in TaskSetManagers.  The
`sortedTaskSets` is a flattened queue of the TaskSets, and the available
cores are offered to those TaskSets in that queued order until the next
time the scheduler backend handles the available resource offers and a new
`sortedTaskSets` is generated.

On Fri, Sep 2, 2016 at 2:37 AM, enrico d'urso  wrote:

> Thank you.
>
> May I know when that comparator is called?
> It looks like spark scheduler has not any form of preemption, am I right?
>
> Thank you
> --
> *From:* Mark Hamstra 
> *Sent:* Thursday, September 1, 2016 8:44:10 PM
>
> *To:* enrico d'urso
> *Cc:* user@spark.apache.org
> *Subject:* Re: Spark scheduling mode
>
> Spark's FairSchedulingAlgorithm is not round robin: https://github.com/
> apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/
> SchedulingAlgorithm.scala#L43
>
> When at the scope of fair scheduling Jobs within a single Pool, the
> Schedulable entities being handled (s1 and s2) are TaskSetManagers, which
> are at the granularity of Stages, not Jobs.  Since weight is 1 and minShare
> is 0 for TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers
> just boils down to prioritizing TaskSets (i.e. Stages) with the fewest
> number of runningTasks.
>
> On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso  wrote:
>
>> I tried it before, but still I am not able to see a proper round robin
>> across the jobs I submit.
>> Given this:
>>
>> 
>> FAIR
>> 1
>> 2
>>   
>>
>> Each jobs inside production pool should be scheduled in round robin way,
>> am I right?
>>
>> --
>> *From:* Mark Hamstra 
>> *Sent:* Thursday, September 1, 2016 8:19:44 PM
>> *To:* enrico d'urso
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Spark scheduling mode
>>
>> The default pool (``) can be configured like any
>> other pool: https://spark.apache.org/docs/latest/job-scheduling.
>> html#configuring-pool-properties
>>
>> On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso  wrote:
>>
>>> Is there a way to force scheduling to be fair *inside* the default pool?
>>> I mean, round robin for the jobs that belong to the default pool.
>>>
>>> Cheers,
>>> --
>>> *From:* Mark Hamstra 
>>> *Sent:* Thursday, September 1, 2016 7:24:54 PM
>>> *To:* enrico d'urso
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Spark scheduling mode
>>>
>>> Just because you've flipped spark.scheduler.mode to FAIR, that doesn't
>>> mean that Spark can magically configure and start multiple scheduling pools
>>> for you, nor can it know to which pools you want jobs assigned.  Without
>>> doing any setup of additional scheduling pools or assigning of jobs to
>>> pools, you're just dumping all of your jobs into the one available default
>>> pool (which is now being fair scheduled with an empty set of other pools)
>>> and the scheduling of jobs within that pool is still the default intra-pool
>>> scheduling, FIFO -- i.e., you've effectively accomplished nothing by only
>>> flipping spark.scheduler.mode to FAIR.
>>>
>>> On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso  wrote:
>>>
 I am building a Spark App, in which I submit several jobs (pyspark). I
 am using threads to run them in parallel, and also I am setting:
 conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run
 serially in FIFO way. Am I missing something?

 Cheers,


 Enrico

>>>
>>>
>>
>


Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread Cody Koeninger
If you just want to pause the whole stream, just stop the app and then
restart it when you're ready.

If you want to do some type of per-partition manipulation, you're
going to need to write some code.  The 0.10 integration makes the
underlying kafka consumer pluggable, so you may be able to wrap a
consumer to do what you need.

On Fri, Sep 2, 2016 at 12:28 PM, sagarcasual .  wrote:
> Hello, this is for
> Pausing spark kafka streaming (direct) or exclude/include some partitions on
> the fly per batch
> =
> I have following code that creates a direct stream using Kafka connector for
> Spark.
>
> final JavaInputDStream msgRecords =
> KafkaUtils.createDirectStream(
> jssc, String.class, String.class, StringDecoder.class,
> StringDecoder.class,
> KafkaMessage.class, kafkaParams, topicsPartitions,
> message -> {
> return KafkaMessage.builder()
> .
> .build();
> }
> );
>
> However I want to handle a situation, where I can decide that this streaming
> needs to pause for a while on conditional basis, is there any way to achieve
> this? Say my Kafka is undergoing some maintenance, so between 10AM to 12PM
> stop processing, and then again pick up at 12PM from the last offset, how do
> I do it?
>
> Also, assume all of a sudden we want to take one-or-more of the partitions
> for a pull and add it back after some pulls, how do I achieve that?
>
> -Regards
> Sagar
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-02 Thread sagarcasual .
Hello, this is for
Pausing spark kafka streaming (direct) or exclude/include some partitions
on the fly per batch
=
I have following code that creates a direct stream using Kafka connector
for Spark.

final JavaInputDStream msgRecords =
KafkaUtils.createDirectStream(
jssc, String.class, String.class, StringDecoder.class,
StringDecoder.class,
KafkaMessage.class, kafkaParams, topicsPartitions,
message -> {
return KafkaMessage.builder()
.
.build();
}
);

However I want to handle a situation, where I can decide that this
streaming needs to pause for a while on conditional basis, is there any way
to achieve this? Say my Kafka is undergoing some maintenance, so between
10AM to 12PM stop processing, and then again pick up at 12PM from the last
offset, how do I do it?

Also, assume all of a sudden we want to take one-or-more of the partitions
for a pull and add it back after some pulls, how do I achieve that?

-Regards
Sagar


Does Spark support Partition Pruning with Parquet Files

2016-09-02 Thread Lost Overflow
Hello Everyone, Anyone can answer this: http://stackoverflow.com/q/37180073/6022341? Thanks in advance.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala Vs Python

2016-09-02 Thread darren
I politely disagree. The jvm is one vm. Python has another. It's less about 
preference and more about where the skills as an industry is going for data 
analysis and BI etc. No cares about jvm vs. Pvm. They do care about time. So if 
the time to prototype is 10x faster (in calendar days) but the VM is slower in 
cpu cycles, the greater benefit decides what's best. The industry trend is 
clear now.
And seemingly spark is moving in its own direction. In my opinion of course.


Sent from my Verizon, Samsung Galaxy smartphone
 Original message From: Sivakumaran S  
Date: 9/2/16  4:03 AM  (GMT-05:00) To: Mich Talebzadeh 
 Cc: Jakob Odersky , ayan guha 
, Tal Grynbaum , darren 
, kant kodali , AssafMendelson 
, user  Subject: Re: Scala Vs 
Python 
Whatever benefits you may accrue from the rapid prototyping and coding in 
Python, it will be offset against the time taken to convert it to run inside 
the JVM. This of course depends on the complexity of the DAG. I guess it is a 
matter of language preference. 
Regards,
Sivakumaran S
On 02-Sep-2016, at 8:58 AM, Mich Talebzadeh  wrote:
From an outsider point of view nobody likes change :)
However, it appears to me that Scala is a rising star and if one learns it, it 
is another iron in the fire so to speak. I believe as we progress in time Spark 
is going to move away from Python. If you look at 2014 Databricks code 
examples, they were mostly in Python. Now they are mostly in Scala for a reason.
HTH




Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 2 September 2016 at 08:23, Jakob Odersky  wrote:
Forgot to answer your question about feature parity of Python w.r.t. Spark's 
different components
I mostly work with scala so I can't say for sure but I think that all pre-2.0 
features (that's basically everything except Structured Streaming) are on par. 
Structured Streaming is a pretty new feature and Python support is currently 
not available. The API is not final however and I reckon that Python support 
will arrive once it gets finalized, probably in the next version.







Dataset encoder for java.time.LocalDate?

2016-09-02 Thread Daniel Siegmann
It seems Spark can handle case classes with java.sql.Date, but not
java.time.LocalDate. It complains there's no encoder.

Are there any plans to add an encoder for LocalDate (and other classes in
the new Java 8 Time and Date API), or is there an existing library I can
use that provides encoders?

--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
No offence taken. Glad that it was rectified.

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 16:03, Nicholas Chammas 
wrote:

> I apologize for my harsh tone. You are right, it was unnecessary and
> discourteous.
>
> On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> You made such statement:
>>
>> "That's complete nonsense."
>>
>> That is a strong language and void of any courtesy. Only dogmatic
>> individuals make such statements, engaging the keyboard before thinking
>> about it.
>>
>> You are perfectly in your right to agree to differ. However, that does
>> not give you the right to call other peoples opinion nonsense.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 September 2016 at 15:54, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> You made a specific claim -- that Spark will move away from Python --
>>> which I responded to with clear references and data. How on earth is that a
>>> "religious argument"?
>>>
>>> I'm not saying that Python is better than Scala or anything like that.
>>> I'm just addressing your specific claim about its future in the Spark
>>> project.
>>>
>>> On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Right so. We are back into religious arguments. Best of luck



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 2 September 2016 at 15:35, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> I believe as we progress in time Spark is going to move away from
>> Python. If you look at 2014 Databricks code examples, they were
>> mostly in Python. Now they are mostly in Scala for a reason.
>>
>
> That's complete nonsense.
>
> First off, you can find dozens and dozens of Python code examples
> here: https://github.com/apache/spark/tree/master/
> examples/src/main/python
>
> The Python API was added to Spark in 0.7.0
> , back in
> February of 2013, before Spark was even accepted into the Apache 
> incubator.
> Since then it's undergone major and continuous development. Though it does
> lag behind the Scala API in some areas, it's a first-class language and
> bringing it up to parity with Scala is an explicit project goal. A quick
> example off the top of my head is all the work that's going into model
> import/export for Python: SPARK-11939
> 
>
> Additionally, according to the 2015 Spark Survey
> ,
> 58% of Spark users use the Python API, more than any other language save
> for Scala (71%). (Users can select multiple languages on the survey.)
> Python users were also the 3rd-fastest growing "demographic" for Spark,
> after Windows and Spark Streaming users.
>
> Any notion that Spark is going to "move away from Python" is
> completely 

Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
I apologize for my harsh tone. You are right, it was unnecessary and
discourteous.

On Fri, Sep 2, 2016 at 11:01 AM Mich Talebzadeh 
wrote:

> Hi,
>
> You made such statement:
>
> "That's complete nonsense."
>
> That is a strong language and void of any courtesy. Only dogmatic
> individuals make such statements, engaging the keyboard before thinking
> about it.
>
> You are perfectly in your right to agree to differ. However, that does not
> give you the right to call other peoples opinion nonsense.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:54, Nicholas Chammas  > wrote:
>
>> You made a specific claim -- that Spark will move away from Python --
>> which I responded to with clear references and data. How on earth is that a
>> "religious argument"?
>>
>> I'm not saying that Python is better than Scala or anything like that.
>> I'm just addressing your specific claim about its future in the Spark
>> project.
>>
>> On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Right so. We are back into religious arguments. Best of luck
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 September 2016 at 15:35, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> I believe as we progress in time Spark is going to move away from
> Python. If you look at 2014 Databricks code examples, they were
> mostly in Python. Now they are mostly in Scala for a reason.
>

 That's complete nonsense.

 First off, you can find dozens and dozens of Python code examples here:
 https://github.com/apache/spark/tree/master/examples/src/main/python

 The Python API was added to Spark in 0.7.0
 , back in
 February of 2013, before Spark was even accepted into the Apache incubator.
 Since then it's undergone major and continuous development. Though it does
 lag behind the Scala API in some areas, it's a first-class language and
 bringing it up to parity with Scala is an explicit project goal. A quick
 example off the top of my head is all the work that's going into model
 import/export for Python: SPARK-11939
 

 Additionally, according to the 2015 Spark Survey
 ,
 58% of Spark users use the Python API, more than any other language save
 for Scala (71%). (Users can select multiple languages on the survey.)
 Python users were also the 3rd-fastest growing "demographic" for Spark,
 after Windows and Spark Streaming users.

 Any notion that Spark is going to "move away from Python" is completely
 contradicted by the facts.

 Nick


>>>
>


Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
You made a specific claim -- that Spark will move away from Python -- which
I responded to with clear references and data. How on earth is that a
"religious argument"?

I'm not saying that Python is better than Scala or anything like that. I'm
just addressing your specific claim about its future in the Spark project.

On Fri, Sep 2, 2016 at 10:48 AM Mich Talebzadeh 
wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas  > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> , back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> 
>>
>> Additionally, according to the 2015 Spark Survey
>> ,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
>


Re: Scala Vs Python

2016-09-02 Thread andy petrella
looking at the examples, indeed they make nonsense :D

On Fri, 2 Sep 2016 16:48 Mich Talebzadeh,  wrote:

> Right so. We are back into religious arguments. Best of luck
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 September 2016 at 15:35, Nicholas Chammas  > wrote:
>
>> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
>> wrote:
>>
>>> I believe as we progress in time Spark is going to move away from
>>> Python. If you look at 2014 Databricks code examples, they were mostly
>>> in Python. Now they are mostly in Scala for a reason.
>>>
>>
>> That's complete nonsense.
>>
>> First off, you can find dozens and dozens of Python code examples here:
>> https://github.com/apache/spark/tree/master/examples/src/main/python
>>
>> The Python API was added to Spark in 0.7.0
>> , back in
>> February of 2013, before Spark was even accepted into the Apache incubator.
>> Since then it's undergone major and continuous development. Though it does
>> lag behind the Scala API in some areas, it's a first-class language and
>> bringing it up to parity with Scala is an explicit project goal. A quick
>> example off the top of my head is all the work that's going into model
>> import/export for Python: SPARK-11939
>> 
>>
>> Additionally, according to the 2015 Spark Survey
>> ,
>> 58% of Spark users use the Python API, more than any other language save
>> for Scala (71%). (Users can select multiple languages on the survey.)
>> Python users were also the 3rd-fastest growing "demographic" for Spark,
>> after Windows and Spark Streaming users.
>>
>> Any notion that Spark is going to "move away from Python" is completely
>> contradicted by the facts.
>>
>> Nick
>>
>>
> --
andy


Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
Right so. We are back into religious arguments. Best of luck



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 15:35, Nicholas Chammas 
wrote:

> On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
> wrote:
>
>> I believe as we progress in time Spark is going to move away from Python. If
>> you look at 2014 Databricks code examples, they were mostly in Python. Now
>> they are mostly in Scala for a reason.
>>
>
> That's complete nonsense.
>
> First off, you can find dozens and dozens of Python code examples here:
> https://github.com/apache/spark/tree/master/examples/src/main/python
>
> The Python API was added to Spark in 0.7.0
> , back in
> February of 2013, before Spark was even accepted into the Apache incubator.
> Since then it's undergone major and continuous development. Though it does
> lag behind the Scala API in some areas, it's a first-class language and
> bringing it up to parity with Scala is an explicit project goal. A quick
> example off the top of my head is all the work that's going into model
> import/export for Python: SPARK-11939
> 
>
> Additionally, according to the 2015 Spark Survey
> ,
> 58% of Spark users use the Python API, more than any other language save
> for Scala (71%). (Users can select multiple languages on the survey.)
> Python users were also the 3rd-fastest growing "demographic" for Spark,
> after Windows and Spark Streaming users.
>
> Any notion that Spark is going to "move away from Python" is completely
> contradicted by the facts.
>
> Nick
>
>


Re: BinaryClassificationMetrics - get raw tp/fp/tn/fn stats per threshold?

2016-09-02 Thread Sean Owen
Given recall by threshold, you can compute true positive count per
threshold by just multiplying through by the count of elements where
label = 1. From that you can get false negatives by subtracting from
that same count.

Given precision by threshold, and true positives count by threshold,
you can work out how the count of elements that were predicted as
positive, by threshold. That should then get you to false positive
count, and from there you can similarly work out false negatives.

So I think you can reverse engineer those stats from precision and
recall without much trouble. I don't think there's a more direct way,
no, unless you want to clone the code and/or hack into it with
reflection.

On Fri, Sep 2, 2016 at 2:54 PM, Spencer, Alex (Santander)
 wrote:
> Hi,
>
>
>
> BinaryClassificationMetrics expose recall and precision byThreshold. Is
> there a way to true negatives / false negatives etc per threshold?
>
>
>
> I have weighted my genuines and would like the adjusted precision / FPR.
> (Unless there is an option that I’ve missed, although I have been over the
> Class twice now and can’t see any weighting options). I had to build my own,
> which seems a bit like reinventing the wheel (isn’t as safe + fast for a
> start):
>
>
>
> val threshold_stats =
> metrics.thresholds.cartesian(predictionAndLabels).map{case (t, (prob,
> label)) =>
>
>   val selected = (prob >= t)
>
>   val fraud = (label == 1.0)
>
>
>
>   val tp = if (fraud && selected) 1 else 0
>
>   val fp = if (!fraud && selected) 1 else 0
>
>   val tn = if (!fraud && !selected) 1 else 0
>
>   val fn = if (fraud && !selected) 1 else 0
>
>
>
>   (t, (tp, fp, tn, fn))
>
> }.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 +
> y._4))
>
>
>
> Kind Regards,
>
> Alex.
>
>
> Emails aren't always secure, and they may be intercepted or changed after
> they've been sent. Santander doesn't accept liability if this happens. If
> you
> think someone may have interfered with this email, please get in touch with
> the
> sender another way. This message doesn't create or change any contract.
> Santander doesn't accept responsibility for damage caused by any viruses
> contained in this email or its attachments. Emails may be monitored. If
> you've
> received this email by mistake, please let the sender know at once that it's
> gone to the wrong person and then destroy it without copying, using, or
> telling
> anyone about its contents.
> Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc
> Reg.
> No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1
> 3AN.
> Registered in England. www.santander.co.uk. Authorised by the Prudential
> Regulation Authority and regulated by the Financial Conduct Authority and
> the
> Prudential Regulation Authority. FCA Reg. No. 106054 and 146003
> respectively.
> Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg.
> No.
> 02666793. Registered Office: Kingfisher House, Radford Way, Billericay,
> Essex
> CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA
> Reg.
> No. 154210. You can check this on the Financial Services Register by
> visiting
> the FCA’s website www.fca.org.uk/register or by contacting the FCA on 0800
> 111
> 6768. Santander UK plc is also licensed by the Financial Supervision
> Commission
> of the Isle of Man for its branch in the Isle of Man. Deposits held with the
> Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
> Scheme as set out in the Isle of Man Depositors’ Compensation Scheme
> Regulations
> 2010. In the Isle of Man, Santander UK plc’s principal place of business is
> at
> 19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame
> logo
> are registered trademarks.
> Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
> Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
> Corporate & Commercial is a brand name used by Santander UK plc, Abbey
> National
> Treasury Services plc and Santander Asset Finance plc.
> Ref:[PDB#1-4A]
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Scala Vs Python

2016-09-02 Thread Nicholas Chammas
On Fri, Sep 2, 2016 at 3:58 AM Mich Talebzadeh 
wrote:

> I believe as we progress in time Spark is going to move away from Python. If
> you look at 2014 Databricks code examples, they were mostly in Python. Now
> they are mostly in Scala for a reason.
>

That's complete nonsense.

First off, you can find dozens and dozens of Python code examples here:
https://github.com/apache/spark/tree/master/examples/src/main/python

The Python API was added to Spark in 0.7.0
, back in February
of 2013, before Spark was even accepted into the Apache incubator. Since
then it's undergone major and continuous development. Though it does lag
behind the Scala API in some areas, it's a first-class language and
bringing it up to parity with Scala is an explicit project goal. A quick
example off the top of my head is all the work that's going into model
import/export for Python: SPARK-11939


Additionally, according to the 2015 Spark Survey
,
58% of Spark users use the Python API, more than any other language save
for Scala (71%). (Users can select multiple languages on the survey.)
Python users were also the 3rd-fastest growing "demographic" for Spark,
after Windows and Spark Streaming users.

Any notion that Spark is going to "move away from Python" is completely
contradicted by the facts.

Nick


BinaryClassificationMetrics - get raw tp/fp/tn/fn stats per threshold?

2016-09-02 Thread Spencer, Alex (Santander)
Hi,

BinaryClassificationMetrics expose recall and precision byThreshold. Is there a 
way to true negatives / false negatives etc per threshold?

I have weighted my genuines and would like the adjusted precision / FPR. 
(Unless there is an option that I've missed, although I have been over the 
Class twice now and can't see any weighting options). I had to build my own, 
which seems a bit like reinventing the wheel (isn't as safe + fast for a start):

val threshold_stats = 
metrics.thresholds.cartesian(predictionAndLabels).map{case (t, (prob, label)) =>
  val selected = (prob >= t)
  val fraud = (label == 1.0)

  val tp = if (fraud && selected) 1 else 0
  val fp = if (!fraud && selected) 1 else 0
  val tn = if (!fraud && !selected) 1 else 0
  val fn = if (fraud && !selected) 1 else 0

  (t, (tp, fp, tn, fn))
}.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, x._3 + y._3, x._4 + y._4))

Kind Regards,
Alex.
Emails aren't always secure, and they may be intercepted or changed after
they've been sent. Santander doesn't accept liability if this happens. If you
think someone may have interfered with this email, please get in touch with the
sender another way. This message doesn't create or change any contract.
Santander doesn't accept responsibility for damage caused by any viruses
contained in this email or its attachments. Emails may be monitored. If you've
received this email by mistake, please let the sender know at once that it's
gone to the wrong person and then destroy it without copying, using, or telling
anyone about its contents.
Santander UK plc Reg. No. 2294747 and Abbey National Treasury Services plc Reg.
No. 2338548 Registered Offices: 2 Triton Square, Regent's Place, London NW1 3AN.
Registered in England. www.santander.co.uk. Authorised by the Prudential
Regulation Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority. FCA Reg. No. 106054 and 146003 respectively.
Santander Sharedealing is a trading name of Abbey Stockbrokers Limited Reg. No.
02666793. Registered Office: Kingfisher House, Radford Way, Billericay, Essex
CM12 0GZ. Authorised and regulated by the Financial Conduct Authority. FCA Reg.
No. 154210. You can check this on the Financial Services Register by visiting
the FCA’s website www.fca.org.uk/register or by contacting the FCA on 0800 111
6768. Santander UK plc is also licensed by the Financial Supervision Commission
of the Isle of Man for its branch in the Isle of Man. Deposits held with the
Isle of Man branch are covered by the Isle of Man Depositors’ Compensation
Scheme as set out in the Isle of Man Depositors’ Compensation Scheme Regulations
2010. In the Isle of Man, Santander UK plc’s principal place of business is at
19/21 Prospect Hill, Douglas, Isle of Man, IM1 1ET. Santander and the flame logo
are registered trademarks.
Santander Asset Finance plc. Reg. No. 1533123. Registered Office: 2 Triton
Square, Regent’s Place, London NW1 3AN. Registered in England. Santander
Corporate & Commercial is a brand name used by Santander UK plc, Abbey National
Treasury Services plc and Santander Asset Finance plc.
Ref:[PDB#1-4A]
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Passing Custom App Id for consumption in History Server

2016-09-02 Thread Amit Shanker
Currently Spark sets current time in Milliseconds as the app Id. Is there a
way one can pass in the app id to the spark job, so that it uses this
provided app id instead of generating one using time?

Lets take the following scenario : I have a system application which
schedules spark jobs, and records the metadata for that job (say job
params, cores, etc). In this system application, I want to link every job
with its corresponding UI (history server). The only way I can do this is
if I have the app Id of that job stored in this system application. And the
only way one can get the app Id is by using the
SparkContext.getApplicationId() function - which needs to be run from
inside the job. So, this make it difficult to convey this piece of
information from spark to a system outside spark.

Thanks,
Amit Shanker


Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Ian Stokes Rees

On 9/2/16 3:47 AM, Felix Cheung wrote:

There is an Anaconda parcel one could readily install on CDH

https://docs.continuum.io/anaconda/cloudera

As Sean says it is Python 2.7.x.

Spark should work for both 2.7 and 3.5.


Yes, I'm actually an engineer at Continuum, so I know the Anaconda 
parcel pretty well.  It is more a question of whether CDH and Spark 
"work" better with PySpark on Python 2.7 or Python 3.5.  My sense was 
"you choose: both are fine", but I wanted to ask here before committing 
to going down one path or another.


Thanks,

Ian


Re: Grouping on bucketed and sorted columns

2016-09-02 Thread Fridtjof Sander
I succeeded to do some experimental evaluation, and it seems I correctly 
understood the code:
A partition that consist of hive-buckets is read bucket-file by 
bucket-file, which leads to the loss of internal sorting.


Does anyone have an opinion about my alternative idea of reading from 
multiple bucket-files simultaneously to keep that ordering?


Regarding the followup questions:

1. I found the `collect_list()`function, which seems provide what I 
want. However, I fail to collect more than one column. Is there a way to 
do basically: .agg(collect_list("*")) ?


2. I worked around that problem by writing and reading the table within 
the same context/session, so that the ephemeral metastore doesn't loose 
it's content. However, in general a hive-metastore seems to be required 
for a production usage, since there is only an ephemeral- and a 
hive-catalog implementation available in 2.0.0.


I would highly appreciate some feedback to my thoughts and questions

Am 31.08.2016 um 14:45 schrieb Fridtjof Sander:

Hi Spark users,

I'm currently investigating spark's bucketing and partitioning 
capabilities and I have some questions:


Let /T/ be a table that is bucketed and sorted by /T.id/ and 
partitioned by /T.date/. Before persisting, /T/ has been repartitioned 
by /T.id/ to get only one file per bucket.

I want to group by /T.id/ over a subset of /T.date/'s values.

It seems to me that the best execution plan in this scenario would be 
the following:
- Schedule one stage (no exchange) with as many tasks as we have 
bucket-ids, so that there is a mapping from each task to a bucket-id
- Each tasks opens all bucket-files belonging to "it's" bucket-id 
simultaneously, which is one per affected partition /T.date/
- Since the data inside the buckets are sorted, we can perform the 
second phase of "two-phase-multiway-merge-sort" to get our groups, 
which can be "pipelined" into the next operator


From what I understand after scanning through the code, however, it 
appears to me that each bucket-file is read completely before the 
record-iterator is advanced to the next bucket file (see FileScanRDD , 
same applies to Hive). So a groupBy would require to sort the 
partitions of the resulting RDD before the groups can be emitted, 
which results in a blocking operation.


Could anyone confirm that I'm assessing the situation correctly here, 
or correct me if not?


Followup questions:

1. Is there a way to get the "sql" groups into the RDD API, like the 
RDD groupBy would return them? I fail to formulate a job like this, 
because a query with groupBy, that misses an aggregation function, is 
invalid.
2. I haven't simply testet this, because I fail to load a table with 
the specified properties like above:

After writing a table like this:
.write().partitionBy("date").bucketBy(4,"id").sortBy("id").format("json").saveAsTable("table");
I fail to read it again, with the partitioning and bucketing being 
recognized.
Is a functioning Hive-Metastore required for this to work, or is there 
a workaround?


I hope someone can spare the time to help me out here.

All the best,
Fridtjof





Re[2]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-02 Thread Сергей Романов

Hi, Mich,

Column x29 does not seems to be any special. It's a newly created table and I 
did not calculate stats for any columns. Actually, I can sum a single column 
several times in query and face some landshift performance hit at some "magic" 
point. Setting "set spark.sql.codegen.wholeStage=false" makes all requests run 
in a similar slow time (which is slower original time).
PS. Does Stats even helps for Spark queries?
SELECT field, SUM(x28) FROM parquet_table WHERE partition = 1 GROUP BY field
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`,  SUM(`dd_convs`) AS `dd_convs`  FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
 LIMIT 30;
30 rows selected (1.37 seconds)
30 rows selected (1.382 seconds)
30 rows selected (1.399 seconds)

SELECT field, SUM(x29) FROM parquet_table WHERE partition = 1 GROUP BY field
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`,  SUM(`actual_dsp_fee`) AS `actual_dsp_fee`  FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
 LIMIT 30; 
30 rows selected (1.379 seconds)
30 rows selected (1.382 seconds)
30 rows selected (1.377 seconds)
SELECT field, SUM(x28) x repeat 40 times  FROM parquet_table WHERE partition = 
1 GROUP BY field -> 1.774s
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS 
`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY `advertiser_id` 
 LIMIT 30;
30 rows selected (1.774 seconds)
30 rows selected (1.721 seconds)
30 rows selected (2.813 seconds)
SELECT field, SUM(x28) x repeat 41 times FROM parquet_table WHERE partition = 1 
GROUP BY field -> 7.314s
0: jdbc:hive2://spark-master1.uslicer> SELECT `advertiser_id` AS 
`advertiser_id`, SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS 
`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`)  
FROM `slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28'  GROUP BY 
`advertiser_id`  LIMIT 30;
30 rows selected (7.314 seconds)
30 rows selected (7.27 seconds)
30 rows selected (7.279 seconds)
SELECT SUM(x28) x repeat 57 times FROM parquet_table WHERE partition = 1 -> 
1.378s
0: jdbc:hive2://spark-master1.uslicer> EXPLAIN SELECT SUM(`dd_convs`) AS 
`dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`),SUM(`dd_convs`) AS `dd_convs`, 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`) AS `dd_convs`, SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) AS `dd_convs`, 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`), 
SUM(`dd_convs`), SUM(`dd_convs`), SUM(`dd_convs`) FROM 
`slicer`.`573_slicer_rnd_13`  WHERE dt = '2016-07-28';
plan  == Physical Plan ==
*HashAggregate(keys=[], functions=[sum(dd_convs#159025L), 
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), 
sum(dd_convs#159025L), sum(dd_convs#159025L), sum(dd_convs#159025L), 
sum(dd_convs#159025L), sum(dd_convs#159025L), 

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-02 Thread Aseem Bansal
Hi

Thanks for all the details. I was able to convert from ml.NaiveBayesModel
to mllib.NaiveBayesModel and get it done. It is fast for our use case.

Just one question. Before mllib is removed can ml package be expected to
reach feature parity with mllib?

On Thu, Sep 1, 2016 at 7:12 PM, Sean Owen  wrote:

> Yeah there's a method to predict one Vector in the .mllib API but not
> the newer one. You could possibly hack your way into calling it
> anyway, or just clone the logic.
>
> On Thu, Sep 1, 2016 at 2:37 PM, Nick Pentreath 
> wrote:
> > Right now you are correct that Spark ML APIs do not support predicting
> on a
> > single instance (whether Vector for the models or a Row for a pipeline).
> >
> > See https://issues.apache.org/jira/browse/SPARK-10413 and
> > https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
> > discussion.
> >
> > There may be movement in the short term to support the single Vector
> case.
> > But anything for pipelines is not immediately on the horizon I'd say.
> >
> > N
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark scheduling mode

2016-09-02 Thread enrico d'urso
Thank you.

May I know when that comparator is called?
It looks like spark scheduler has not any form of preemption, am I right?

Thank you


From: Mark Hamstra 
Sent: Thursday, September 1, 2016 8:44:10 PM
To: enrico d'urso
Cc: user@spark.apache.org
Subject: Re: Spark scheduling mode

Spark's FairSchedulingAlgorithm is not round robin: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulingAlgorithm.scala#L43

When at the scope of fair scheduling Jobs within a single Pool, the Schedulable 
entities being handled (s1 and s2) are TaskSetManagers, which are at the 
granularity of Stages, not Jobs.  Since weight is 1 and minShare is 0 for 
TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers just boils 
down to prioritizing TaskSets (i.e. Stages) with the fewest number of 
runningTasks.

On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso 
> wrote:

I tried it before, but still I am not able to see a proper round robin across 
the jobs I submit.
Given this:


FAIR
1
2
  

Each jobs inside production pool should be scheduled in round robin way, am I 
right?


From: Mark Hamstra >
Sent: Thursday, September 1, 2016 8:19:44 PM
To: enrico d'urso
Cc: user@spark.apache.org
Subject: Re: Spark scheduling mode

The default pool (``) can be configured like any other 
pool: 
https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties

On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso 
> wrote:

Is there a way to force scheduling to be fair inside the default pool?
I mean, round robin for the jobs that belong to the default pool.

Cheers,


From: Mark Hamstra >
Sent: Thursday, September 1, 2016 7:24:54 PM
To: enrico d'urso
Cc: user@spark.apache.org
Subject: Re: Spark scheduling mode

Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean 
that Spark can magically configure and start multiple scheduling pools for you, 
nor can it know to which pools you want jobs assigned.  Without doing any setup 
of additional scheduling pools or assigning of jobs to pools, you're just 
dumping all of your jobs into the one available default pool (which is now 
being fair scheduled with an empty set of other pools) and the scheduling of 
jobs within that pool is still the default intra-pool scheduling, FIFO -- i.e., 
you've effectively accomplished nothing by only flipping spark.scheduler.mode 
to FAIR.

On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso 
> wrote:

I am building a Spark App, in which I submit several jobs (pyspark). I am using 
threads to run them in parallel, and also I am setting: 
conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run serially in 
FIFO way. Am I missing something?

Cheers,


Enrico





Re: Fwd: Need some help

2016-09-02 Thread Aakash Basu
Hi Shashank/All,

Yes you got it right, that's what I need to do. Can I get some help in
this? I've no clue what it is and how to work on it.

Thanks,
Aakash.

On Fri, Sep 2, 2016 at 1:48 AM, Shashank Mandil 
wrote:

> Hi Aakash,
>
> I think what it generally means that you have to use the general spark
> APIs of Dataframe to bring in the data and crunch the numbers, however you
> cannot use the KMeansClustering algorithm which is already present in the
> MLlib spark library.
>
> I think a good place to start would be understanding what the KMeans
> clustering algorithm is and then looking into how you can use the DataFrame
> API to implement the KMeansClustering.
>
> Thanks,
> Shashank
>
> On Thu, Sep 1, 2016 at 1:05 PM, Aakash Basu 
> wrote:
>
>> Hey Siva,
>>
>> It needs to be done with Spark, without the use of any Spark libraries.
>> Need some help in this.
>>
>> Thanks,
>> Aakash.
>>
>> On Fri, Sep 2, 2016 at 1:25 AM, Sivakumaran S 
>> wrote:
>>
>>> If you are to do it without Spark, you are asking at the wrong place.
>>> Try Python + scikit-learn. Or R. If you want to do it with a UI based
>>> software, try Weka or Orange.
>>>
>>> Regards,
>>>
>>> Sivakumaran S
>>>
>>> On 1 Sep 2016 8:42 p.m., Aakash Basu  wrote:
>>>
>>>
>>> -- Forwarded message --
>>> From: *Aakash Basu* 
>>> Date: Thu, Aug 25, 2016 at 10:06 PM
>>> Subject: Need some help
>>> To: user@spark.apache.org
>>>
>>>
>>> Hi all,
>>>
>>> Aakash here, need a little help in KMeans clustering.
>>>
>>> This is needed to be done:
>>>
>>> "Implement Kmeans Clustering Algorithm without using the libraries of
>>> Spark. You're given a txt file with object ids and features from which you
>>> have to use the features as your data points. This will be a part of the
>>> code itself"
>>>
>>> PFA the file with ObjectIDs and features. Now how to go ahead and work
>>> on it?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>>
>>>
>>
>


Re: Scala Vs Python

2016-09-02 Thread Sivakumaran S
Whatever benefits you may accrue from the rapid prototyping and coding in 
Python, it will be offset against the time taken to convert it to run inside 
the JVM. This of course depends on the complexity of the DAG. I guess it is a 
matter of language preference. 

Regards,

Sivakumaran S
> On 02-Sep-2016, at 8:58 AM, Mich Talebzadeh  wrote:
> 
> From an outsider point of view nobody likes change :)
> 
> However, it appears to me that Scala is a rising star and if one learns it, 
> it is another iron in the fire so to speak. I believe as we progress in time 
> Spark is going to move away from Python. If you look at 2014 Databricks code 
> examples, they were mostly in Python. Now they are mostly in Scala for a 
> reason.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 September 2016 at 08:23, Jakob Odersky  > wrote:
> Forgot to answer your question about feature parity of Python w.r.t. Spark's 
> different components
> I mostly work with scala so I can't say for sure but I think that all pre-2.0 
> features (that's basically everything except Structured Streaming) are on 
> par. Structured Streaming is a pretty new feature and Python support is 
> currently not available. The API is not final however and I reckon that 
> Python support will arrive once it gets finalized, probably in the next 
> version.
> 
> 



Re: Scala Vs Python

2016-09-02 Thread Mich Talebzadeh
>From an outsider point of view nobody likes change :)

However, it appears to me that Scala is a rising star and if one learns it,
it is another iron in the fire so to speak. I believe as we progress in
time Spark is going to move away from Python. If you look at 2014
Databricks code examples, they were mostly in Python. Now they are mostly
in Scala for a reason.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 September 2016 at 08:23, Jakob Odersky  wrote:

> Forgot to answer your question about feature parity of Python w.r.t.
> Spark's different components
> I mostly work with scala so I can't say for sure but I think that all
> pre-2.0 features (that's basically everything except Structured Streaming)
> are on par. Structured Streaming is a pretty new feature and Python support
> is currently not available. The API is not final however and I reckon that
> Python support will arrive once it gets finalized, probably in the next
> version.
>
>


Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Felix Cheung
There is an Anaconda parcel one could readily install on CDH

https://docs.continuum.io/anaconda/cloudera

As Sean says it is Python 2.7.x.

Spark should work for both 2.7 and 3.5.

_
From: Sean Owen >
Sent: Friday, September 2, 2016 12:41 AM
Subject: Re: PySpark: preference for Python 2.7 or Python 3.5?
To: Ian Stokes Rees >
Cc: user @spark >


Spark should work fine with Python 3. I'm not a Python person, but all else 
equal I'd use 3.5 too. I assume the issue could be libraries you want that 
don't support Python 3. I don't think that changes with CDH. It includes a 
version of Anaconda from Continuum, but that lays down Python 2.7.11. I don't 
believe there's any particular position on 2 vs 3.

On Fri, Sep 2, 2016 at 3:56 AM, Ian Stokes Rees 
> wrote:
I have the option of running PySpark with Python 2.7 or Python 3.5.  I am 
fairly expert with Python and know the Python-side history of the differences.  
All else being the same, I have a preference for Python 3.5.  I'm using CDH 5.8 
and I'm wondering if that biases whether I should proceed with PySpark on top 
of Python 2.7 or 3.5.  Opinions?  Does Cloudera have an official (or 
unofficial) position on this?

Thanks,

Ian
___
Ian Stokes-Rees
Computational Scientist

[Continuum Analytics]
@ijstokes [Twitter]   [LinkedIn] 
  [Github]   
617.942.0218





Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Sean Owen
Spark should work fine with Python 3. I'm not a Python person, but all else
equal I'd use 3.5 too. I assume the issue could be libraries you want that
don't support Python 3. I don't think that changes with CDH. It includes a
version of Anaconda from Continuum, but that lays down Python 2.7.11. I
don't believe there's any particular position on 2 vs 3.

On Fri, Sep 2, 2016 at 3:56 AM, Ian Stokes Rees 
wrote:

> I have the option of running PySpark with Python 2.7 or Python 3.5.  I am
> fairly expert with Python and know the Python-side history of the
> differences.  All else being the same, I have a preference for Python 3.5.
> I'm using CDH 5.8 and I'm wondering if that biases whether I should proceed
> with PySpark on top of Python 2.7 or 3.5.  Opinions?  Does Cloudera have an
> official (or unofficial) position on this?
>
> Thanks,
>
> Ian
> ___
> Ian Stokes-Rees
> Computational Scientist
>
> [image: Continuum Analytics] 
> @ijstokes [image: Twitter]  [image: LinkedIn]
>  [image: Github]
>  617.942.0218
>


Re: [Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-09-02 Thread Divya Gehlot
Hi Steve,
I am trying to read it from S3n://"bucket" and already included aws-java-sdk
1.7.4 in my classpath .
My machine is AWS EMR with HAdoop 2.7.2 and Spark 1.6.1 installed .
As per the below post its shows that issue with EMR Hadoop2.7.2
http://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
Is it really the issue ?
Could somebody help me validate the above ?


Thanks,
Divya



On 1 September 2016 at 16:59, Steve Loughran  wrote:

>
> On 1 Sep 2016, at 03:45, Divya Gehlot  wrote:
>
> Hi,
> I am using Spark 1.6.1 in EMR machine
> I am trying to read s3 buckets in my Spark job .
> When I read it through Spark shell I am able to read it ,but when I try to
> package the job and and run it as spark submit I am getting below error
>
> 16/08/31 07:36:38 INFO ApplicationMaster: Registered signal handlers for
> [TERM, HUP, INT]
>
>> 16/08/31 07:36:39 INFO ApplicationMaster: ApplicationAttemptId:
>> appattempt_1468570153734_2851_01
>> Exception in thread "main" java.util.ServiceConfigurationError:
>> org.apache.hadoop.fs.FileSystem: Provider 
>> org.apache.hadoop.fs.s3a.S3AFileSystem
>> could not be instantiated
>> at java.util.ServiceLoader.fail(ServiceLoader.java:224)
>> at java.util.ServiceLoader.access$100(ServiceLoader.java:181)
>> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:377)
>>
> I have already included
>
>  "com.amazonaws" % "aws-java-sdk-s3" % "1.11.15",
>
> in my build.sbt
>
>
> Assuming you are using a released version of Hadoop 2.6 or 2.7 underneath
> spark, you will need to make sure your classpath has aws-java-sdk 1.7.4 on
> your CP. You can't just drop in a new JAR as it is incompatible at the API
> level ( https://issues.apache.org/jira/browse/HADOOP-12269 )
>
>
> 
>   com.amazonaws
>   aws-java-sdk
>   1.7.4
>   compile
> 
>
>
> and jackson artifacts databind and annotations in sync with the rest of
> your app
>
>
> 
>   com.fasterxml.jackson.core
>   jackson-databind
> 
> 
>   com.fasterxml.jackson.core
>   jackson-annotations
> 
>
>
> I tried the provinding the access key also in my job still the same error
> persists.
>
> when I googled it I if you have IAM role created there is no need to
> provide access key .
>
>
>
> You don't get IAM support until Hadoop 2.8 ships. sorry. Needed a fair
> amount of reworking of how S3A does authentication.
>
> Note that if you launch spark jobs with the AWS environment variables set,
> these will be automatically picked up and used to set the relevant
> properties in the configuration.
>


Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
Forgot to answer your question about feature parity of Python w.r.t.
Spark's different components
I mostly work with scala so I can't say for sure but I think that all
pre-2.0 features (that's basically everything except Structured Streaming)
are on par. Structured Streaming is a pretty new feature and Python support
is currently not available. The API is not final however and I reckon that
Python support will arrive once it gets finalized, probably in the next
version.


Re: Scala Vs Python

2016-09-02 Thread Jakob Odersky
As you point out, often the reason that Python support lags behind is that
functionality is implemented in Scala, so the API in that language is
"free" whereas Python support needs to be added explicitly. Nevertheless,
Python bindings are an important part of Spark and is used by many people
(this info could be outdated but Python used to be the second most popular
language after Scala). I expect Python support to only get better in the
future so I think it is fair to say that Python is a first-class citizen in
Spark.

Regarding performance, the issue is more complicated. This is mostly due to
the fact that the actual execution of actions happens in JVM-land and any
correspondance between Python and the JVM is expensive. So the question
basically boils down to "how often does python need to communicate with the
JVM"? The answer depends on the Spark APIs you're using:

1. Plain old RDDs: for every function you pass to a transformation (filter,
map, etc) an intermediate result will be shipped to a Pyhon interpreter,
the function applied, and finally the result shipped back to the JVM.
2. DataFrames with RDD-like transformations or User Defined Functions: same
as point 1, any functions are applied in a Python environment and hence
data needs to be transferred.
3. DataFrames with only SQL expressions: Spark query optimizer will take
care of computing and executing an internal representation of your
transformations and no data communication needs to happen between Python
and the JVM (apart from final results in case you asked for them, i.e. by
calling a collect()).

In cases 1 and 2, there will be a lack in performance compared to
equivalent Scala or Java versions. The difference in case 3 is negligible
as all language APIs will share the same backend .See this blog post from
Databricks for some more detailed information:
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-dataframes-and-more.html

I hope this was the kind of information you were looking for. Please note
however that performance in Spark is a complex topic, the scenarios I
mentioned above should nevertheless give you some rule of thumb.

best,
--Jakob

On Thu, Sep 1, 2016 at 11:25 PM, ayan guha  wrote:

> Tal: I think by nature of the project itself, Python APIs are developed
> after Scala and Java, and it is a fair trade off between speed of getting
> stuff to market. And more and more this discussion is progressing, I see
> not much issue in terms of feature parity.
>
> Coming back to performance, Darren raised a good point: if I can scale
> out, individual VM performance should not matter much. But performance is
> often stated as a definitive downside of using Python over scala/java. I am
> trying to understand the truth and myth behind this claim. Any pointer
> would be great.
>
> best
> Ayan
>
> On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
> wrote:
>
>>
>> On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:
>>
>>> This topic is a concern for us as well. In the data science world no one
>>> uses native scala or java by choice. It's R and Python. And python is
>>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>>
>>> This is why we have decoupled from spark in our project. It's really
>>> unfortunate spark team have invested so heavily in scale.
>>>
>>> As for speed it comes from horizontal scaling and throughout. When you
>>> can scale outward, individual VM performance is less an issue. Basic HPC
>>> principles.
>>>
>>
>> Darren,
>>
>> My guess is that data scientist who will decouple themselves from spark,
>> will eventually left with more or less nothing. (single process
>> capabilities, or purely performing HPC's) (unless, unlikely, some good
>> spark competitor will emerge.  unlikely, simply because there is no need
>> for such).
>> But putting guessing aside - the reason python is 3rd in line for feature
>> support, is not because the spark developers were busy with scala, it's
>> because the features that are missing are those that support strong typing.
>> which is not relevant to python.  in other words, even if spark was
>> rewritten in python, and was to focus on python only, you would still not
>> get those features.
>>
>>
>>
>> --
>> *Tal Grynbaum* / *CTO & co-founder*
>>
>> m# +972-54-7875797
>>
>> mobile retention done right
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


RE: Scala Vs Python

2016-09-02 Thread Santoshakhilesh
I have seen a talk by Brian Clapper in NE-SCALA 2016 - RDDs, DataFrames and 
Datasets @ Apache Spark - NE Scala 2016

At 15:00 there is a slide to show a comparison of aggregating 10 Million 
integer pairs using RDD ,  DataFrame with different language bindings like 
Scala , Python , R

As per this slide
DataFrame APIs outperform RDDs and all the Language bindings performance are 
same
RDD with Python is way slower than Scala version So I guess there should be 
some reality in Scala bindings being faster in some case.

@ 30:23 he presents a slide to show the performance of serialization and 
Dataset encoders are way faster than Java and Kyro.

But as always proof of pudding is in eating so why don’t you try some samples 
to see yourself.
I personally have found that my app runs a bit faster with Scala version than 
Java but I am not yet able to figure out the reason.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 02 September 2016 15:25
To: Tal Grynbaum
Cc: darren; Mich Talebzadeh; Jakob Odersky; kant kodali; AssafMendelson; user
Subject: Re: Scala Vs Python

Tal: I think by nature of the project itself, Python APIs are developed after 
Scala and Java, and it is a fair trade off between speed of getting stuff to 
market. And more and more this discussion is progressing, I see not much issue 
in terms of feature parity.

Coming back to performance, Darren raised a good point: if I can scale out, 
individual VM performance should not matter much. But performance is often 
stated as a definitive downside of using Python over scala/java. I am trying to 
understand the truth and myth behind this claim. Any pointer would be great.

best
Ayan

On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
> wrote:

On Fri, Sep 2, 2016 at 1:15 AM, darren 
> wrote:
This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.

This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale.

As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.

Darren,

My guess is that data scientist who will decouple themselves from spark, will 
eventually left with more or less nothing. (single process capabilities, or 
purely performing HPC's) (unless, unlikely, some good spark competitor will 
emerge.  unlikely, simply because there is no need for such).
But putting guessing aside - the reason python is 3rd in line for feature 
support, is not because the spark developers were busy with scala, it's because 
the features that are missing are those that support strong typing. which is 
not relevant to python.  in other words, even if spark was rewritten in python, 
and was to focus on python only, you would still not get those features.



--
Tal Grynbaum / CTO & co-founder

m# +972-54-7875797
[cid:image001.png@01D20532.AC944EB0]
mobile retention done right



--
Best Regards,
Ayan Guha


Re: Scala Vs Python

2016-09-02 Thread ayan guha
Tal: I think by nature of the project itself, Python APIs are developed
after Scala and Java, and it is a fair trade off between speed of getting
stuff to market. And more and more this discussion is progressing, I see
not much issue in terms of feature parity.

Coming back to performance, Darren raised a good point: if I can scale out,
individual VM performance should not matter much. But performance is often
stated as a definitive downside of using Python over scala/java. I am
trying to understand the truth and myth behind this claim. Any pointer
would be great.

best
Ayan

On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum  wrote:

>
> On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:
>
>> This topic is a concern for us as well. In the data science world no one
>> uses native scala or java by choice. It's R and Python. And python is
>> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>>
>> This is why we have decoupled from spark in our project. It's really
>> unfortunate spark team have invested so heavily in scale.
>>
>> As for speed it comes from horizontal scaling and throughout. When you
>> can scale outward, individual VM performance is less an issue. Basic HPC
>> principles.
>>
>
> Darren,
>
> My guess is that data scientist who will decouple themselves from spark,
> will eventually left with more or less nothing. (single process
> capabilities, or purely performing HPC's) (unless, unlikely, some good
> spark competitor will emerge.  unlikely, simply because there is no need
> for such).
> But putting guessing aside - the reason python is 3rd in line for feature
> support, is not because the spark developers were busy with scala, it's
> because the features that are missing are those that support strong typing.
> which is not relevant to python.  in other words, even if spark was
> rewritten in python, and was to focus on python only, you would still not
> get those features.
>
>
>
> --
> *Tal Grynbaum* / *CTO & co-founder*
>
> m# +972-54-7875797
>
> mobile retention done right
>



-- 
Best Regards,
Ayan Guha


Re: Custom return code

2016-09-02 Thread Pierre Villard
Any hint?

2016-08-31 20:40 GMT+02:00 Pierre Villard :

> Hi,
>
> I am using Spark 1.5.2 and I am submitting a job (jar file) using
> spark-submit command in a yarn cluster mode. I'd like the command to return
> a custom return code.
>
> In the run method, if I do:
> sys.exit(myCode)
> the command will always return 0.
>
> The only way to have something not equal to 0 is to throw an exception and
> this will return 1.
>
> Is there a way to have a custom return code from the job application?
>
> Thanks a lot!
>


Re: Scala Vs Python

2016-09-02 Thread Tal Grynbaum
On Fri, Sep 2, 2016 at 1:15 AM, darren  wrote:

> This topic is a concern for us as well. In the data science world no one
> uses native scala or java by choice. It's R and Python. And python is
> growing. Yet in spark, python is 3rd in line for feature support, if at all.
>
> This is why we have decoupled from spark in our project. It's really
> unfortunate spark team have invested so heavily in scale.
>
> As for speed it comes from horizontal scaling and throughout. When you can
> scale outward, individual VM performance is less an issue. Basic HPC
> principles.
>

Darren,

My guess is that data scientist who will decouple themselves from spark,
will eventually left with more or less nothing. (single process
capabilities, or purely performing HPC's) (unless, unlikely, some good
spark competitor will emerge.  unlikely, simply because there is no need
for such).
But putting guessing aside - the reason python is 3rd in line for feature
support, is not because the spark developers were busy with scala, it's
because the features that are missing are those that support strong typing.
which is not relevant to python.  in other words, even if spark was
rewritten in python, and was to focus on python only, you would still not
get those features.



-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right