Re: Running Hive and Spark together with Dynamic Resource Allocation

2016-10-28 Thread Stéphane Verlet
This works for us

 
yarn.nodemanager.aux-services
mapreduce_shuffle,spark_shuffle
  

  
yarn.nodemanager.aux-services.mapreduce_shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
  

  
yarn.nodemanager.aux-services.spark_shuffle.class
org.apache.spark.network.yarn.YarnShuffleService
  





On Thu, Oct 27, 2016 at 8:13 AM, rachmaninovquartet <
rachmaninovquar...@gmail.com> wrote:

> oblem is that for Spark 1.5.2 with dynamic resource allocation to
> function properly we needed to set y
>


java.lang.OutOfMemoryError: unable to create new native thread

2016-10-28 Thread kant kodali
 "dag-scheduler-event-loop" java.lang.OutOfMemoryError: unable to create
new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(
ForkJoinPool.java:1672)
at scala.concurrent.forkjoin.ForkJoinPool.signalWork(
ForkJoinPool.java:1966)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.push(
ForkJoinPool.java:1072)
at scala.concurrent.forkjoin.ForkJoinTask.fork(
ForkJoinTask.java:654)
at scala.collection.parallel.ForkJoinTasks$WrappedTask$

This is the error produced by the Spark Driver program which is running on
client mode by default so some people say just increase the heap size by
passing the --driver-memory 3g flag however the message *"**unable to
create new native thread**"*  really says that the JVM is asking OS to
create a new thread but OS couldn't allocate it anymore and the number of
threads a JVM can create by requesting OS is platform dependent but
typically it is 32K threads on a 64-bit JVM. so I am wondering why spark is
even creating so many threads and how do I control this number?


Re: Can i get callback notification on Spark job completion ?

2016-10-28 Thread Marcelo Vanzin
On Fri, Oct 28, 2016 at 11:14 AM, Elkhan Dadashov  wrote:
> But if the map task will finish before the Spark job finishes, that means
> SparkLauncher will go away. if the SparkLauncher handle goes away, then I
> lose the ability to track the app's state, right ?
>
> I'm investigating if there is a way to know Spark job completion (without
> Spark Job History Server) in asynchronous manner.

Correct. As I said in my other reply to you, if you can't use Spark's
API for whatever reason, you have to talk directly to the cluster
managers, and at that point it's out of Spark's hands to help you.

-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Can i get callback notification on Spark job completion ?

2016-10-28 Thread Elkhan Dadashov
Hi Marcelo,

Thanks for the reply.

But that means SparkAppHandle need to stay alive until Spark job completes.

In my case  Iaunch Spark job from the delegator Map task in cluster. That
means the map task container need to stay alive, and wait until Spark Job
completes.

But if the map task will finish before the Spark job finishes, that means
SparkLauncher will go away. if the SparkLauncher handle goes away, then I
lose the ability to track the app's state, right ?

I'm investigating if there is a way to know Spark job completion (without
Spark Job History Server) in asynchronous manner.

Like get a callback on MapReduce job completion - getting a notification
instead of polling.

*[Another option]*
According to Spark docs
 Spark
Metrics can be configured with different sinks.

I wonder whether it is possible to determine the job completion status from
these metrics.

Do Spark metrics provide job state information for each job id too ?

On Fri, Oct 28, 2016 at 11:05 AM Marcelo Vanzin  wrote:

If you look at the "startApplication" method it takes listeners as
parameters.

On Fri, Oct 28, 2016 at 10:23 AM, Elkhan Dadashov 
wrote:
> Hi,
>
> I know that we can use SparkAppHandle (introduced in SparkLauncher version
>>=1.6), and lt the delegator map task stay alive until the Spark job
> finishes. But i wonder, if this can be done via callback notification
> instead of polling.
>
> Can i get callback notification on Spark job completion ?
>
> Similar to Hadoop, get a callback on MapReduce job completion - getting a
> notification instead of polling.
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value. Can be retrieved from notification URL
> both the JOB_ID and JOB_STATUS.
>
> ...
> Configuration conf = this.getConf();
> // Set the callback parameters
> conf.set("job.end.notification.url",
> "
https://hadoopi.wordpress.com/api/hadoop/notification/$jobId?status=$jobStatus
");
> ...
> // Submit your job in background
> job.submit();
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value:
>
> https://
/api/hadoop/notification/job_1379509275868_0002?status=SUCCEEDED
>
> Reference:
>
https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
>
> Thanks.



--
Marcelo


Re: Can i get callback notification on Spark job completion ?

2016-10-28 Thread Marcelo Vanzin
If you look at the "startApplication" method it takes listeners as parameters.

On Fri, Oct 28, 2016 at 10:23 AM, Elkhan Dadashov  wrote:
> Hi,
>
> I know that we can use SparkAppHandle (introduced in SparkLauncher version
>>=1.6), and lt the delegator map task stay alive until the Spark job
> finishes. But i wonder, if this can be done via callback notification
> instead of polling.
>
> Can i get callback notification on Spark job completion ?
>
> Similar to Hadoop, get a callback on MapReduce job completion - getting a
> notification instead of polling.
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value. Can be retrieved from notification URL
> both the JOB_ID and JOB_STATUS.
>
> ...
> Configuration conf = this.getConf();
> // Set the callback parameters
> conf.set("job.end.notification.url",
> "https://hadoopi.wordpress.com/api/hadoop/notification/$jobId?status=$jobStatus;);
> ...
> // Submit your job in background
> job.submit();
>
> At job completion, an HTTP request will be sent to
> “job.end.notification.url” value:
>
> https:///api/hadoop/notification/job_1379509275868_0002?status=SUCCEEDED
>
> Reference:
> https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/
>
> Thanks.



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Can i get callback notification on Spark job completion ?

2016-10-28 Thread Elkhan Dadashov
Hi,

I know that we can use SparkAppHandle (introduced in SparkLauncher version
>=1.6), and lt the delegator map task stay alive until the Spark job
finishes. But i wonder, if this can be done via callback notification
instead of polling.

Can i get callback notification on Spark job completion ?

Similar to Hadoop, get a callback on MapReduce job completion - getting a
notification instead of polling.

At job completion, an HTTP request will be sent to
“job.end.notification.url” value. Can be retrieved from notification URL
both the JOB_ID and JOB_STATUS.

...
Configuration conf = this.getConf();
// Set the callback parameters
conf.set("job.end.notification.url", "
https://hadoopi.wordpress.com/api/hadoop/notification/$*jobId*?status=$
*jobStatus*");
...
// Submit your job in background
job.submit();

At job completion, an HTTP request will be sent to
“job.end.notification.url” value:

https://
/api/hadoop/notification/job_1379509275868_0002?status=SUCCEEDED

Reference:
https://hadoopi.wordpress.com/2013/09/18/hadoop-get-a-callback-on-mapreduce-job-completion/


Thanks.


Re: Does the delegator map task of SparkLauncher need to stay alive until Spark job finishes ?

2016-10-28 Thread Elkhan Dadashov
I figured out JOB id returned from sparkAppHandle.getAppId(), is unique
ApplicationId which looks like these:

for Local mode Spark env: Local-1477184581895
For Distributed Spark mode: Application_1477504900821_0005

ApplicationId represents the globally unique identifier for an application.

The globally unique nature of the identifier is achieved by using the cluster
timestamp i.e. start-time of the ResourceManager along with a monotonically
increasing counter for the application.




On Sat, Oct 22, 2016 at 5:18 PM Elkhan Dadashov 
wrote:

> I found answer regarding logging in the JavaDoc of SparkLauncher:
>
> "Currently, all applications are launched as child processes. The child's
> stdout and stderr are merged and written to a logger (see
> java.util.logging)."
>
> One last question. sparkAppHandle.getAppId() - does this function
> return org.apache.hadoop.mapred.*JobID* which makes it easy tracking in
> Yarn ? Or is appId just the Spark app name we assign ?
>
> If it is JobID, then even if the SparkLauncher handle goes away, by
> talking directly to the cluster manager, i can get Job details.
>
> Thanks.
>
> On Sat, Oct 22, 2016 at 4:53 PM Elkhan Dadashov 
> wrote:
>
> Thanks, Marcelo.
>
> One more question regarding getting logs.
>
> In previous implementation of SparkLauncer we could read logs from :
>
> sparkLauncher.getInputStream()
> sparkLauncher.getErrorStream()
>
> What is the recommended way of getting logs and logging of Spark execution
> while using sparkLauncer#startApplication() ?
>
> Thanks.
>
> On Tue, Oct 18, 2016 at 3:07 PM Marcelo Vanzin 
> wrote:
>
> On Tue, Oct 18, 2016 at 3:01 PM, Elkhan Dadashov 
> wrote:
> > Does my map task need to wait until Spark job finishes ?
>
> No...
>
> > Or is there any way, my map task finishes after launching Spark job, and
> I
> > can still query and get status of Spark job outside of map task (or
> failure
> > reason, if it has failed) ? (maybe by querying Spark job id ?)
>
> ...but if the SparkLauncher handle goes away, then you lose the
> ability to track the app's state, unless you talk directly to the
> cluster manager.
>
> > I guess also if i want my Spark job to be killed, if corresponding
> delegator
> > map task is killed, that means my map task needs to stay alive, so i
> still
> > have SparkAppHandle reference ?
>
> Correct, unless you talk directly to the cluster manager.
>
> --
> Marcelo
>
>


Re: CSV escaping not working

2016-10-28 Thread Daniel Barclay

In any case, it seems that the current behavior is not documented sufficiently.

Koert Kuipers wrote:

i can see how unquoted csv would work if you escape delimiters, but i have 
never seen that in practice.

On Thu, Oct 27, 2016 at 2:03 PM, Jain, Nishit > wrote:

I’d think quoting is only necessary if you are not escaping delimiters in 
data. But we can only share our opinions. It would be good to see something 
documented.
This may be the cause of the issue?: 
https://issues.apache.org/jira/browse/CSV-135 


From: Koert Kuipers >
Date: Thursday, October 27, 2016 at 12:49 PM

To: "Jain, Nishit" >
Cc: "user@spark.apache.org " >
Subject: Re: CSV escaping not working

well my expectation would be that if you have delimiters in your data you 
need to quote your values. if you now have quotes without your data you need to 
escape them.

so escaping is only necessary if quoted.

On Thu, Oct 27, 2016 at 1:45 PM, Jain, Nishit > wrote:

Do you mind sharing why should escaping not work without quotes?

From: Koert Kuipers >
Date: Thursday, October 27, 2016 at 12:40 PM
To: "Jain, Nishit" >
Cc: "user@spark.apache.org " 
>
Subject: Re: CSV escaping not working

that is what i would expect: escaping only works if quoted

On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit > wrote:

Interesting finding: Escaping works if data is quoted but not 
otherwise.

From: "Jain, Nishit" >
Date: Thursday, October 27, 2016 at 10:54 AM
To: "user@spark.apache.org " 
>
Subject: CSV escaping not working

I am using spark-core version 2.0.1 with Scala 2.11. I have simple 
code to read a csv file which has \ escapes.

|val myDA = spark.read .option("quote",null) .schema(mySchema) 
.csv(filePath) |

As per documentation \ is default escape for csv reader. But it 
does not work. Spark is reading \ as part of my data. For Ex: City column in 
csv file is *north rocks\,au* . I am expecting city column should read in code 
as *northrocks,au*. But instead spark reads it as *northrocks\* and moves *au* 
to next column.

I have tried following but did not work:

  * Explicitly defined escape .option("escape",”\\")
  * Changed escape to | or : in file and in code
  * I have tried using spark-csv library

Any one facing same issue? Am I missing something?

Thanks







-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-28 Thread Chetan Khatri
Thank you for everyone,  origin question " Every time, i write to parquet.
it shows on Spark UI that stages succeeded but on spark shell it hold
context on wait mode for almost 10 mins. then it clears broadcast,
accumulator shared variables.".

I don't think stopping context can resolve current issue.

It takes more time to clear Broadcast, accumulator etc.

Can we tune up this with spark 1.6.1 MapR distribution.
On Oct 27, 2016 2:34 PM, "Mehrez Alachheb"  wrote:

> I think you should just shut down your SparkContext at the end.
> sc.stop()
>
> 2016-10-21 22:47 GMT+02:00 Chetan Khatri :
>
>> Hello Spark Users,
>>
>> I am writing around 10 GB of Processed Data to Parquet where having 1 TB
>> of HDD and 102 GB of RAM, 16 vCore machine on Google Cloud.
>>
>> Every time, i write to parquet. it shows on Spark UI that stages
>> succeeded but on spark shell it hold context on wait mode for almost 10
>> mins. then it clears broadcast, accumulator shared variables.
>>
>> Can we sped up this thing ?
>>
>> Thanks.
>>
>> --
>> Yours Aye,
>> Chetan Khatri.
>> M.+91 7 80574
>> Data Science Researcher
>> INDIA
>>
>> ​​Statement of Confidentiality
>> 
>> The contents of this e-mail message and any attachments are confidential
>> and are intended solely for addressee. The information may also be legally
>> privileged. This transmission is sent in trust, for the sole purpose of
>> delivery to the intended recipient. If you have received this transmission
>> in error, any use, reproduction or dissemination of this transmission is
>> strictly prohibited. If you are not the intended recipient, please
>> immediately notify the sender by reply e-mail or phone and delete this
>> message and its attachments, if any.​​
>>
>
>


Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Zoltán Zvara
Worked for me 2 weeks ago with a 3.0.0-alpha2 snapshot. Just changed
hadoop.version while building.

On Fri, Oct 28, 2016, 11:50 Sean Owen  wrote:

> I don't think it works, but, there is no Hadoop 3.0 right now either. As
> the version implies, it's going to be somewhat different API-wise.
>
> On Thu, Oct 27, 2016 at 11:04 PM adam kramer  wrote:
>
> Is the version of Spark built for Hadoop 2.7 and later only for 2.x
> releases?
>
> Is there any reason why Hadoop 3.0 is a non-starter for use with Spark
> 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which
> would resolve our driver dependency issues.
>
> Thanks,
> Adam
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Sean Owen
https://issues.apache.org/jira/browse/SPARK-17898

On Fri, Oct 28, 2016 at 11:56 AM Aseem Bansal  wrote:

> Hi
>
> We are trying to use some of our artifacts as dependencies while
> submitting spark jobs. To specify the remote artifactory URL we are using
> the following syntax
>
>
> https://USERNAME:passw...@artifactory.companyname.com/artifactory/COMPANYNAME-libs
>
> But the resolution fails. Although the URL which is in the logs for the
> artifact is accessible via a browser due to the username password being
> present
>
> So to use an enterprise artifactory with spark is there a special way to
> specify the username and password when passing the repositories String?
>


Re: [SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
To add to the above I have already checked the documentation, API and even
looked at the source code at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L882
but I could not find anything hence I am asking here.

On Fri, Oct 28, 2016 at 4:26 PM, Aseem Bansal  wrote:

> Hi
>
> We are trying to use some of our artifacts as dependencies while
> submitting spark jobs. To specify the remote artifactory URL we are using
> the following syntax
>
> https://USERNAME:passw...@artifactory.companyname.com/
> artifactory/COMPANYNAME-libs
>
> But the resolution fails. Although the URL which is in the logs for the
> artifact is accessible via a browser due to the username password being
> present
>
> So to use an enterprise artifactory with spark is there a special way to
> specify the username and password when passing the repositories String?
>


[SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
Hi

We are trying to use some of our artifacts as dependencies while submitting
spark jobs. To specify the remote artifactory URL we are using the
following syntax

https://USERNAME:passw...@artifactory.companyname.com/artifactory/COMPANYNAME-libs

But the resolution fails. Although the URL which is in the logs for the
artifact is accessible via a browser due to the username password being
present

So to use an enterprise artifactory with spark is there a special way to
specify the username and password when passing the repositories String?


Weekly aggregation

2016-10-28 Thread Oshadha Gunawardena
Hello all,

Please look in to this matter raised on Stackoverflow.com

http://stackoverflow.com/questions/40302893/apache-spark-weekly-aggregation

Thanks.


Re: Executor shutdown hook and initialization

2016-10-28 Thread Sean Owen
Have a look at this ancient JIRA for a lot more discussion about this:
https://issues.apache.org/jira/browse/SPARK-650 You have exactly the same
issue described by another user. For your context, your approach is sound.

You can set a shutdown hook using the normal Java Runtime API. You may not
even need it; if your only resource is some data in memory or a daemon
thread it will take care of itself.

You can also consider rearchitecting to avoid needing global state.

Per-partition resource management is easy. You just use mapPartitions, open
reosurces per partition at the start of the function, and close them in a
finally block, and do your work on the iterator over data in between.

On Thu, Oct 27, 2016 at 3:44 PM Walter rakoff 
wrote:

> Thanks for the info Sean.
>
> I'm initializing them in a singleton but Scala objects are evaluated
> lazily.
> So it gets initialized only when the first task is run(and makes use of
> the object).
> Plan is to start a background thread in the object that does periodic
> cache refresh too.
> I'm trying to see if this init can be done right when executor is created.
>
> Btw, this is for a Spark streaming app. So doing this per partition during
> each batch isn't ideal.
> I'd like to keep them(connect & cache) across batches.
>
> Finally, how do I setup the shutdown hook on an executor? Except for
> operations on RDD everything else is executed in the driver.
> All I can think of is something like this
> sc.makeRDD((1 until sc.defaultParallelism), sc.defaultParallelism)
>.foreachPartition(sys.ShutdownHookThread { Singleton.DoCleanup() } )
>
> Walt
>
>


Re: Sharing RDDS across applications and users

2016-10-28 Thread vincent gromakowski
Bad idea. No caching, cluster over consumption...
Have a look on instantiating a custom thriftserver on temp tables with
fair  scheduler to allow concurrent SQL requests. It's not a public API but
you can find some examples.

Le 28 oct. 2016 11:12 AM, "Mich Talebzadeh"  a
écrit :

> Hi,
>
> I think tempTable is private to the session that creates it. In Hive temp
> tables created by "CREATE TEMPORARY TABLE" are all private to the session.
> Spark is no different.
>
> The alternative may be everyone creates tempTable from the same DF?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 October 2016 at 10:03, Chanh Le  wrote:
>
>> Can you elaborate on how to implement "shared sparkcontext and fair
>> scheduling" option?
>>
>>
>> It just reuse 1 Spark Context by not letting it stop when the application
>> had done. Should check: livy, spark-jobserver
>> FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html just how
>> you scheduler your job in the pool but FAIR help you run job in parallel vs
>> FIFO (default) 1 job at the time.
>>
>>
>> My approach was to use  sparkSession.getOrCreate() method and register
>> temp table in one application. However, I was not able to access this
>> tempTable in another application.
>>
>>
>> Store metadata in Hive may help but I am not sure about this.
>> I use Spark Thrift Server create table on that then let Zeppelin query
>> from that.
>>
>> Regards,
>> Chanh
>>
>>
>>
>>
>>
>> On Oct 27, 2016, at 9:01 PM, Victor Shafran 
>> wrote:
>>
>> Hi Vincent,
>> Can you elaborate on how to implement "shared sparkcontext and fair
>> scheduling" option?
>>
>> My approach was to use  sparkSession.getOrCreate() method and register
>> temp table in one application. However, I was not able to access this
>> tempTable in another application.
>> You help is highly appreciated
>> Victor
>>
>> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  wrote:
>>
>>> Hi Mich,
>>>
>>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>>> DataFrames among different applications and contexts. The data typically
>>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>>> om/blog/effective-spark-rdds-with-alluxio
>>>
>>> Also, Alluxio also has the concept of an "Under filesystem", which can
>>> help you access your existing data across different storage systems. Here
>>> is more information about the unified namespace abilities:
>>> http://www.alluxio.org/docs/master/en/Unified-and
>>> -Transparent-Namespace.html
>>>
>>> Hope that helps,
>>> Gene
>>>
>>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks Chanh,

 Can it share RDDs.

 Personally I have not used either Alluxio or Ignite.


1. Are there major differences between these two
2. Have you tried Alluxio for sharing Spark RDDs and if so do you
have any experience you can kindly share

 Regards


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 27 October 2016 at 11:29, Chanh Le  wrote:

> Hi Mich,
> Alluxio is the good option to go.
>
> Regards,
> Chanh
>
> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> There was a mention of using Zeppelin to share RDDs with many users.
> From the notes on Zeppelin it appears that this is sharing UI and I am not
> sure how easy it is going to be changing the result set with different
> users modifying say sql queries.
>
> There is also the idea of 

Re: Spark 2.0 with Hadoop 3.0?

2016-10-28 Thread Sean Owen
I don't think it works, but, there is no Hadoop 3.0 right now either. As
the version implies, it's going to be somewhat different API-wise.

On Thu, Oct 27, 2016 at 11:04 PM adam kramer  wrote:

> Is the version of Spark built for Hadoop 2.7 and later only for 2.x
> releases?
>
> Is there any reason why Hadoop 3.0 is a non-starter for use with Spark
> 2.0? The version of aws-sdk in 3.0 actually works for DynamoDB which
> would resolve our driver dependency issues.
>
> Thanks,
> Adam
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: LIMIT issue of SparkSQL

2016-10-28 Thread Liz Bai
Sorry for the late reply.
The size of the raw data is 20G and it is composed of two columns. We generated 
it by this 
.
The test queries are very simple,
1). select ColA from Table limit 1
2). select ColA from Table
3). select ColA from Table where ColB=0
4). select ColA from Table where ColB=0 limit 1
We found that if we use `result.collect()`, it does early stop upon getting 
adequate results for query 1) and 4).
However, we used to run `result.write.parquet`, and there is no early stop and 
scans much more data than `result.collect()`.

Below are the detailed testing summary,
Query
Method of Saving Results
Run Time
select ColA from Table limit 1
result.write.Parquet
1m 56s
select ColA from Table
1m 40s
select ColA from Table where ColB=0 limit 1
1m 32s
select ColA from Table where ColB=0 
1m 21s
select ColA from Table limit 1
result.collect()
18s
select ColA from Table where ColB=0 limit 1
18s

Thanks.

Best,
Liz
> On 27 Oct 2016, at 2:16 AM, Michael Armbrust  wrote:
> 
> That is surprising then, you may have found a bug.  What timings are you 
> seeing?  Can you reproduce it with data you can share? I'd open a JIRA if so.
> 
> On Tue, Oct 25, 2016 at 4:32 AM, Liz Bai  > wrote:
> We used Parquet as data source. The query is like “select ColA from table 
> limit 1”. Attached is the query plan of it. (However its run time is just the 
> same as “select ColA from table”.)
> We expected an early stop upon getting 1 result, rather than scanning all 
> records and finally collect it with limit in the final phase. 
> Btw, I agree with Mich’s concerning. `Limit push down` is impossible when 
> involving table joins. But some cases such as “Filter + Projection + Limit”  
> will benefit from `limit push down`.
> May I know if there is any detailed solutions for this?
> 
> Thanks so much.
> 
> Best,
> Liz
> 
> 
>> On 25 Oct 2016, at 5:54 AM, Michael Armbrust > > wrote:
>> 
>> It is not about limits on specific tables.  We do support that.  The case 
>> I'm describing involves pushing limits across system boundaries.  It is 
>> certainly possible to do this, but the current datasource API does provide 
>> this information (other than the implicit limit that is pushed down to the 
>> consumed iterator of the data source).
>> 
>> On Mon, Oct 24, 2016 at 9:11 AM, Mich Talebzadeh > > wrote:
>> This is an interesting point.
>> 
>> As far as I know in any database (practically all RDBMS Oracle, SAP etc), 
>> the LIMIT affects the collection part of the result set.
>> 
>> The result set is carried out fully on the query that may involve multiple 
>> joins on multiple underlying tables.
>> 
>> To limit the actual query by LIMIT on each underlying table does not make 
>> sense and will not be industry standard AFAK.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 24 October 2016 at 06:48, Michael Armbrust > > wrote:
>> - dev + user
>> 
>> Can you give more info about the query?  Maybe a full explain()?  Are you 
>> using a datasource like JDBC?  The API does not currently push down limits, 
>> but the documentation talks about how you can use a query instead of a table 
>> if that is what you are looking to do.
>> 
>> On Mon, Oct 24, 2016 at 5:40 AM, Liz Bai > > wrote:
>> Hi all,
>> 
>> Let me clarify the problem: 
>> 
>> Suppose we have a simple table `A` with 100 000 000 records
>> 
>> Problem:
>> When we execute sql query ‘select * from A Limit 500`,
>> It scan through all 100 000 000 records. 
>> Normal behaviour should be that once 500 records is found, engine stop 
>> scanning.
>> 
>> Detailed observation:
>> We found that there are “GlobalLimit / LocalLimit” physical operators
>> https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
>>  
>> 
>> But during query plan generation, GlobalLimit / LocalLimit is not applied to 
>> the query plan.
>> 
>> Could 

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Hi,

I think tempTable is private to the session that creates it. In Hive temp
tables created by "CREATE TEMPORARY TABLE" are all private to the session.
Spark is no different.

The alternative may be everyone creates tempTable from the same DF?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 28 October 2016 at 10:03, Chanh Le  wrote:

> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
>
> It just reuse 1 Spark Context by not letting it stop when the application
> had done. Should check: livy, spark-jobserver
> FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html just how you
> scheduler your job in the pool but FAIR help you run job in parallel vs
> FIFO (default) 1 job at the time.
>
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
>
>
> Store metadata in Hive may help but I am not sure about this.
> I use Spark Thrift Server create table on that then let Zeppelin query
> from that.
>
> Regards,
> Chanh
>
>
>
>
>
> On Oct 27, 2016, at 9:01 PM, Victor Shafran 
> wrote:
>
> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair
> scheduling" option?
>
> My approach was to use  sparkSession.getOrCreate() method and register
> temp table in one application. However, I was not able to access this
> tempTable in another application.
> You help is highly appreciated
> Victor
>
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  wrote:
>
>> Hi Mich,
>>
>> Yes, Alluxio is commonly used to cache and share Spark RDDs and
>> DataFrames among different applications and contexts. The data typically
>> stays in memory, but with Alluxio's tiered storage, the "colder" data can
>> be evicted out to other medium, like SSDs and HDDs. Here is a blog post
>> discussing Spark RDDs and Alluxio: https://www.alluxio.c
>> om/blog/effective-spark-rdds-with-alluxio
>>
>> Also, Alluxio also has the concept of an "Under filesystem", which can
>> help you access your existing data across different storage systems. Here
>> is more information about the unified namespace abilities:
>> http://www.alluxio.org/docs/master/en/Unified-and
>> -Transparent-Namespace.html
>>
>> Hope that helps,
>> Gene
>>
>> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Chanh,
>>>
>>> Can it share RDDs.
>>>
>>> Personally I have not used either Alluxio or Ignite.
>>>
>>>
>>>1. Are there major differences between these two
>>>2. Have you tried Alluxio for sharing Spark RDDs and if so do you
>>>have any experience you can kindly share
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 27 October 2016 at 11:29, Chanh Le  wrote:
>>>
 Hi Mich,
 Alluxio is the good option to go.

 Regards,
 Chanh

 On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh 
 wrote:


 There was a mention of using Zeppelin to share RDDs with many users.
 From the notes on Zeppelin it appears that this is sharing UI and I am not
 sure how easy it is going to be changing the result set with different
 users modifying say sql queries.

 There is also the idea of caching RDDs with something like Apache
 Ignite. Has anyone really tried this. Will that work with multiple
 applications?

 It looks feasible as RDDs are immutable and so are registered
 tempTables etc.

 Thanks


 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *

Re: Sharing RDDS across applications and users

2016-10-28 Thread Chanh Le
> Can you elaborate on how to implement "shared sparkcontext and fair 
> scheduling" option? 


It just reuse 1 Spark Context by not letting it stop when the application had 
done. Should check: livy, spark-jobserver
FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html 
 just how you 
scheduler your job in the pool but FAIR help you run job in parallel vs FIFO 
(default) 1 job at the time.


> My approach was to use  sparkSession.getOrCreate() method and register temp 
> table in one application. However, I was not able to access this tempTable in 
> another application. 


Store metadata in Hive may help but I am not sure about this.
I use Spark Thrift Server create table on that then let Zeppelin query from 
that.

Regards,
Chanh





> On Oct 27, 2016, at 9:01 PM, Victor Shafran  wrote:
> 
> Hi Vincent,
> Can you elaborate on how to implement "shared sparkcontext and fair 
> scheduling" option? 
> 
> My approach was to use  sparkSession.getOrCreate() method and register temp 
> table in one application. However, I was not able to access this tempTable in 
> another application. 
> You help is highly appreciated 
> Victor
> 
> On Thu, Oct 27, 2016 at 4:31 PM, Gene Pang  > wrote:
> Hi Mich,
> 
> Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames 
> among different applications and contexts. The data typically stays in 
> memory, but with Alluxio's tiered storage, the "colder" data can be evicted 
> out to other medium, like SSDs and HDDs. Here is a blog post discussing Spark 
> RDDs and Alluxio: 
> https://www.alluxio.com/blog/effective-spark-rdds-with-alluxio 
> 
> 
> Also, Alluxio also has the concept of an "Under filesystem", which can help 
> you access your existing data across different storage systems. Here is more 
> information about the unified namespace abilities: 
> http://www.alluxio.org/docs/master/en/Unified-and-Transparent-Namespace.html 
> 
> 
> Hope that helps,
> Gene
> 
> On Thu, Oct 27, 2016 at 3:39 AM, Mich Talebzadeh  > wrote:
> Thanks Chanh,
> 
> Can it share RDDs.
> 
> Personally I have not used either Alluxio or Ignite.
> 
> Are there major differences between these two
> Have you tried Alluxio for sharing Spark RDDs and if so do you have any 
> experience you can kindly share
> Regards
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 27 October 2016 at 11:29, Chanh Le  > wrote:
> Hi Mich,
> Alluxio is the good option to go. 
> 
> Regards,
> Chanh
> 
>> On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh > > wrote:
>> 
>> 
>> There was a mention of using Zeppelin to share RDDs with many users. From 
>> the notes on Zeppelin it appears that this is sharing UI and I am not sure 
>> how easy it is going to be changing the result set with different users 
>> modifying say sql queries.
>> 
>> There is also the idea of caching RDDs with something like Apache Ignite. 
>> Has anyone really tried this. Will that work with multiple applications?
>> 
>> It looks feasible as RDDs are immutable and so are registered tempTables etc.
>> 
>> Thanks
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
> 
> 
> 
> 
> 
> 
> -- 
> Victor Shafran
> 
> VP R| Equalum
> 
> 
> Mobile: +972-523854883  | Email: 
> victor.shaf...@equalum.io 


Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Thanks all for your advice.

As I understand in layman's term if I had two applications running
successfully where app 2 was dependent on app 1 I would finish app 1, store
the results in HDFS and the app 2 starts reading the results from HDFS and
work on it.

Using  Alluxio or others replaces HDFS with in-memory storage where app 2
can pick up app1 results from memory or even SSD and do the work.

Actually I am surprised why Spark has not incorporated this type of memory
as temporary storage.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 27 October 2016 at 11:28, Mich Talebzadeh 
wrote:

>
> There was a mention of using Zeppelin to share RDDs with many users. From
> the notes on Zeppelin it appears that this is sharing UI and I am not sure
> how easy it is going to be changing the result set with different users
> modifying say sql queries.
>
> There is also the idea of caching RDDs with something like Apache Ignite.
> Has anyone really tried this. Will that work with multiple applications?
>
> It looks feasible as RDDs are immutable and so are registered tempTables
> etc.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


convert spark dataframe to numpy (ndarray)

2016-10-28 Thread Zakaria Hili
Hi,

Is there any way to convert a spark dataframe into numpy ndarray without
using toPandas operation ?

Example:

C1   C2  C3   C4 0.7 3.0 1000 109540.9 4.2 1200 12345

I want to get this output:

[(0.7, 3.0, 1000L, 10954),(0.9, 4.2, 1200L, 12345)],
dtype=[('C1', '