Re: deploy-mode flag in spark-sql cli

2016-06-29 Thread Saisai Shao
I think you cannot use sql client in the cluster mode, also for
spark-shell/pyspark which has a repl, all these application can only be
started with client deploy mode.

On Thu, Jun 30, 2016 at 12:46 PM, Mich Talebzadeh  wrote:

> Hi,
>
> When you use spark-shell or for that matter spark-sql, you are staring
> spark-submit under the bonnet. These two shells are created to make life
> easier to work on Spark.
>
>
> However, if you look at what $SPARK_HOME/bin/spark-sql does in the
> script, you will notice my point:
>
>
>
> exec "${SPARK_HOME}"/bin/spark-submit --class
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"
>
>  So that is basically spark-submit JVM
>
>
>
> Since it is using spark-submit it takes all the parameters related to
> spark-submit. You can test it using your own customised shell script with
> parameters passed.
>
>
> ${SPARK_HOME}/bin/spark-submit \
> --driver-memory xG \
> --num-executors n \
> --executor-memory xG \
> --executor-cores m \
> --master yarn \
> --deploy-mode cluster \
>
> --class
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"
>
>
> Does your version of spark support cluster mode?
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 30 June 2016 at 05:16, Huang Meilong  wrote:
>
>> Hello,
>>
>>
>> I added deploy-mode flag in spark-sql cli like this:
>>
>> $ spark-sql --deploy-mode cluster --master yarn -e "select * from mx"
>>
>>
>> It showed error saying "Cluster deploy mode is not applicable to Spark
>> SQL shell", but "spark-sql --help" shows "--deploy-mode" option. Is this
>> a bug?
>>
>
>


Re: deploy-mode flag in spark-sql cli

2016-06-29 Thread Mich Talebzadeh
Hi,

When you use spark-shell or for that matter spark-sql, you are staring
spark-submit under the bonnet. These two shells are created to make life
easier to work on Spark.


However, if you look at what $SPARK_HOME/bin/spark-sql does in the
script, you will notice my point:



exec "${SPARK_HOME}"/bin/spark-submit --class
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"

 So that is basically spark-submit JVM



Since it is using spark-submit it takes all the parameters related to
spark-submit. You can test it using your own customised shell script with
parameters passed.


${SPARK_HOME}/bin/spark-submit \
--driver-memory xG \
--num-executors n \
--executor-memory xG \
--executor-cores m \
--master yarn \
--deploy-mode cluster \

--class
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver "$@"


Does your version of spark support cluster mode?


HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 30 June 2016 at 05:16, Huang Meilong  wrote:

> Hello,
>
>
> I added deploy-mode flag in spark-sql cli like this:
>
> $ spark-sql --deploy-mode cluster --master yarn -e "select * from mx"
>
>
> It showed error saying "Cluster deploy mode is not applicable to Spark
> SQL shell", but "spark-sql --help" shows "--deploy-mode" option. Is this
> a bug?
>


Re: Error report file is deleted automatically after spark application finished

2016-06-29 Thread dhruve ashar
You can look at the yarn-default configuration file.

Check your log related settings to see if log aggregation is enabled or
also the log retention duration to see if its too small and files are being
deleted.

On Wed, Jun 29, 2016 at 4:47 PM, prateek arora 
wrote:

>
> Hi
>
> My Spark application was crashed and show information
>
> LogType:stdout
> Log Upload Time:Wed Jun 29 14:38:03 -0700 2016
> LogLength:1096
> Log Contents:
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGILL (0x4) at pc=0x7f67baa0d221, pid=12207, tid=140083473176320
> #
> # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
> 1.7.0_67-b01)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [libcaffe.so.1.0.0-rc3+0x786221]  sgemm_kernel+0x21
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> #
>
> /yarn/nm/usercache/ubuntu/appcache/application_1467236060045_0001/container_1467236060045_0001_01_03/hs_err_pid12207.log
>
>
>
> but I am not able to found
>
> "/yarn/nm/usercache/ubuntu/appcache/application_1467236060045_0001/container_1467236060045_0001_01_03/hs_err_pid12207.log"
> file . its deleted  automatically after Spark application
>  finished
>
>
> how  to retain report file , i am running spark with yarn .
>
> Regards
> Prateek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Error-report-file-is-deleted-automatically-after-spark-application-finished-tp27247.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-Dhruve Ashar


deploy-mode flag in spark-sql cli

2016-06-29 Thread Huang Meilong
Hello,


I added deploy-mode flag in spark-sql cli like this:


$ spark-sql --deploy-mode cluster --master yarn -e "select * from mx"


It showed error saying "Cluster deploy mode is not applicable to Spark SQL 
shell", but "spark-sql --help" shows "--deploy-mode" option. Is this a bug?


Regarding Decision Tree

2016-06-29 Thread Chintan Bhatt
Hello,
I want to improve decision tree in spark.
Can anyone help me for parameter tuning for such improvement??


-- 
CHINTAN BHATT 
Assistant Professor,
U & P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of Technology,
Charotar University of Science And Technology (CHARUSAT),
Changa-388421, Gujarat, INDIA.
http://www.charusat.ac.in
*Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/

-- 


DISCLAIMER: The information transmitted is intended only for the person or 
entity to which it is addressed and may contain confidential and/or 
privileged material which is the intellectual property of Charotar 
University of Science & Technology (CHARUSAT). Any review, retransmission, 
dissemination or other use of, or taking of any action in reliance upon 
this information by persons or entities other than the intended recipient 
is strictly prohibited. If you are not the intended recipient, or the 
employee, or agent responsible for delivering the message to the intended 
recipient and/or if you have received this in error, please contact the 
sender and delete the material from the computer or device. CHARUSAT does 
not take any liability or responsibility for any malicious codes/software 
and/or viruses/Trojan horses that may have been picked up during the 
transmission of this message. By opening and solely relying on the contents 
or part thereof this message, and taking action thereof, the recipient 
relieves the CHARUSAT of all the liabilities including any damages done to 
the recipient's pc/laptop/peripherals and other communication devices due 
to any reason.


Re: Using R code as part of a Spark Application

2016-06-29 Thread Sun Rui
Hi, Gilad,

You can try the dapply() and gapply() function in SparkR in Spark 2.0. Yes, it 
is required that R installed on each worker node.

However, if your Spark application is Scala/Java based, it is not supported for 
now to run R code in DataFrames. There is closed lira 
https://issues.apache.org/jira/browse/SPARK-14746 which remains discussion 
purpose. You have to convert DataFrames to RDDs, and use pipe() on RDDs to 
launch external R processes and run R code.

> On Jun 30, 2016, at 07:08, Xinh Huynh  wrote:
> 
> It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0: 
> https://issues.apache.org/jira/browse/SPARK-6817 
> 
> 
> Here's some of the code:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala
>  
> 
> 
> /**
>  * A function wrapper that applies the given R function to each partition.
>  */
> private[sql] case class MapPartitionsRWrapper(
> func: Array[Byte],
> packageNames: Array[Byte],
> broadcastVars: Array[Broadcast[Object]],
> inputSchema: StructType,
> outputSchema: StructType) extends (Iterator[Any] => Iterator[Any]) 
> 
> Xinh
> 
> On Wed, Jun 29, 2016 at 2:59 PM, Sean Owen  > wrote:
> Here we (or certainly I) am not talking about R Server, but plain vanilla R, 
> as used with Spark and SparkR. Currently, SparkR doesn't distribute R code at 
> all (it used to, sort of), so I'm wondering if that is changing back.
> 
> On Wed, Jun 29, 2016 at 10:53 PM, John Aherne  > wrote:
> I don't think R server requires R on the executor nodes. I originally set up 
> a SparkR cluster for our Data Scientist on Azure which required that I 
> install R on each node, but for the R Server set up, there is an extra edge 
> node with R server that they connect to. From what little research I was able 
> to do, it seems that there are some special functions in R Server that can 
> distribute the work to the cluster. 
> 
> Documentation is light, and hard to find but I found this helpful:
> https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/
>  
> 
> 
> 
> 
> On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen  > wrote:
> Oh, interesting: does this really mean the return of distributing R
> code from driver to executors and running it remotely, or do I
> misunderstand? this would require having R on the executor nodes like
> it used to?
> 
> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh  > wrote:
> > There is some new SparkR functionality coming in Spark 2.0, such as
> > "dapply". You could use SparkR to load a Parquet file and then run "dapply"
> > to apply a function to each partition of a DataFrame.
> >
> > Info about loading Parquet file:
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
> >  
> > 
> >
> > API doc for "dapply":
> > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
> >  
> > 
> >
> > Xinh
> >
> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog  > > wrote:
> >>
> >> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
> >> stuff you want to do on the Rscript stdin,  p
> >>
> >>
> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau  >> >
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>>
> >>> I want to use R code as part of spark application (the same way I would
> >>> do with Scala/Python).  I want to be able to run an R syntax as a map
> >>> function on a big Spark dataframe loaded from a parquet file.
> >>>
> >>> Is this even possible or the only way to use R is as part of RStudio
> >>> orchestration of our Spark  cluster?
> >>>
> >>>
> >>>
> >>> Thanks for the help!
> >>>
> >>>
> >>>
> >>> Gilad
> >>>
> >>>
> >>
> >>
> >
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 
> 
> 
> 
> -- 
> John Aherne
> Big Data and SQL Developer
> 
> 
> Cell:
> Email:
> Skype:
> Web:
> 
> +1 (303) 809-9718 

Re: Using R code as part of a Spark Application

2016-06-29 Thread Xinh Huynh
It looks like it. "DataFrame UDFs in R" is resolved in Spark 2.0:
https://issues.apache.org/jira/browse/SPARK-6817

Here's some of the code:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/r/MapPartitionsRWrapper.scala

/**
* A function wrapper that applies the given R function to each partition.
*/
private[sql] case class MapPartitionsRWrapper(
func: Array[Byte],
packageNames: Array[Byte],
broadcastVars: Array[Broadcast[Object]],
inputSchema: StructType,
outputSchema: StructType) extends (Iterator[Any] => Iterator[Any])

Xinh

On Wed, Jun 29, 2016 at 2:59 PM, Sean Owen  wrote:

> Here we (or certainly I) am not talking about R Server, but plain vanilla
> R, as used with Spark and SparkR. Currently, SparkR doesn't distribute R
> code at all (it used to, sort of), so I'm wondering if that is changing
> back.
>
> On Wed, Jun 29, 2016 at 10:53 PM, John Aherne 
> wrote:
>
>> I don't think R server requires R on the executor nodes. I originally set
>> up a SparkR cluster for our Data Scientist on Azure which required that I
>> install R on each node, but for the R Server set up, there is an extra edge
>> node with R server that they connect to. From what little research I was
>> able to do, it seems that there are some special functions in R Server that
>> can distribute the work to the cluster.
>>
>> Documentation is light, and hard to find but I found this helpful:
>>
>> https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/
>>
>>
>>
>> On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen  wrote:
>>
>>> Oh, interesting: does this really mean the return of distributing R
>>> code from driver to executors and running it remotely, or do I
>>> misunderstand? this would require having R on the executor nodes like
>>> it used to?
>>>
>>> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh 
>>> wrote:
>>> > There is some new SparkR functionality coming in Spark 2.0, such as
>>> > "dapply". You could use SparkR to load a Parquet file and then run
>>> "dapply"
>>> > to apply a function to each partition of a DataFrame.
>>> >
>>> > Info about loading Parquet file:
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>>> >
>>> > API doc for "dapply":
>>> >
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>>> >
>>> > Xinh
>>> >
>>> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog 
>>> wrote:
>>> >>
>>> >> try Spark pipeRDD's , you can invoke the R script from pipe , push
>>> the
>>> >> stuff you want to do on the Rscript stdin,  p
>>> >>
>>> >>
>>> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <
>>> gilad.lan...@clicktale.com>
>>> >> wrote:
>>> >>>
>>> >>> Hello,
>>> >>>
>>> >>>
>>> >>>
>>> >>> I want to use R code as part of spark application (the same way I
>>> would
>>> >>> do with Scala/Python).  I want to be able to run an R syntax as a map
>>> >>> function on a big Spark dataframe loaded from a parquet file.
>>> >>>
>>> >>> Is this even possible or the only way to use R is as part of RStudio
>>> >>> orchestration of our Spark  cluster?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Thanks for the help!
>>> >>>
>>> >>>
>>> >>>
>>> >>> Gilad
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> John Aherne
>> Big Data and SQL Developer
>>
>> [image: JustEnough Logo]
>>
>> Cell:
>> Email:
>> Skype:
>> Web:
>>
>> +1 (303) 809-9718
>> john.ahe...@justenough.com
>> john.aherne.je
>> www.justenough.com
>>
>>
>> Confidentiality Note: The information contained in this email and 
>> document(s) attached are for the exclusive use of the addressee and may 
>> contain confidential, privileged and non-disclosable information. If the 
>> recipient of this email is not the addressee, such recipient is strictly 
>> prohibited from reading, photocopying, distribution or otherwise using this 
>> email or its contents in any way.
>>
>>
>


Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-29 Thread Koert Kuipers
its the difference between a semigroup and a monoid, and yes max does not
easily fit into a monoid.

see also discussion here:
https://issues.apache.org/jira/browse/SPARK-15598

On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela  wrote:

> OK. I see that, but the current (provided) implementations are very naive
> - Sum, Count, Average -let's take Max for example: I guess zero() would be
> set to some value like Long.MIN_VALUE, but what if you trigger (I assume in
> the future Spark streaming will support time-based triggers) for a result
> and there are no events ?
>
> And like I said, for a more general use case: What if my zero() function
> depends on my input ?
>
> I just don't see the benefit of this behaviour, though I realise this is
> the implementation.
>
> Thanks,
> Amit
>
> On Sun, Jun 26, 2016 at 2:09 PM Takeshi Yamamuro 
> wrote:
>
>> No, TypedAggregateExpression that uses Aggregator#zero is different
>> between v2.0 and v1.6.
>> v2.0:
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L91
>> v1.6:
>> https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TypedAggregateExpression.scala#L115
>>
>> // maropu
>>
>>
>> On Sun, Jun 26, 2016 at 8:03 PM, Amit Sela  wrote:
>>
>>> This "if (value == null)" condition you point to exists in 1.6 branch as
>>> well, so that's probably not the reason.
>>>
>>> On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro 
>>> wrote:
>>>
 Whatever it is, this is expected; if an initial value is null, spark
 codegen removes all the aggregates.
 See:
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala#L199

 // maropu

 On Sun, Jun 26, 2016 at 7:46 PM, Amit Sela 
 wrote:

> Not sure about what's the rule in case of `b + null = null` but the
> same code works perfectly in 1.6.1, just tried it..
>
> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
>
>> Hi,
>>
>> This behaviour seems to be expected because you must ensure `b +
>> zero() = b`
>> The your case `b + null = null` breaks this rule.
>> This is the same with v1.6.1.
>> See:
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L57
>>
>> // maropu
>>
>>
>> On Sun, Jun 26, 2016 at 6:06 PM, Amit Sela 
>> wrote:
>>
>>> Sometimes, the BUF for the aggregator may depend on the actual
>>> input.. and while this passes the responsibility to handle null in
>>> merge/reduce to the developer, it sounds fine to me if he is the one who
>>> put null in zero() anyway.
>>> Now, it seems that the aggregation is skipped entirely when zero()
>>> = null. Not sure if that was the behaviour in 1.6
>>>
>>> Is this behaviour wanted ?
>>>
>>> Thanks,
>>> Amit
>>>
>>> Aggregator example:
>>>
>>> public static class Agg extends Aggregator, 
>>> Integer, Integer> {
>>>
>>>   @Override
>>>   public Integer zero() {
>>> return null;
>>>   }
>>>
>>>   @Override
>>>   public Integer reduce(Integer b, Tuple2 a) {
>>> if (b == null) {
>>>   b = 0;
>>> }
>>> return b + a._2();
>>>   }
>>>
>>>   @Override
>>>   public Integer merge(Integer b1, Integer b2) {
>>> if (b1 == null) {
>>>   return b2;
>>> } else if (b2 == null) {
>>>   return b1;
>>> } else {
>>>   return b1 + b2;
>>> }
>>>   }
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


 --
 ---
 Takeshi Yamamuro

>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: Can Spark Dataframes preserve order when joining?

2016-06-29 Thread Mich Talebzadeh
Hi,

Well I would not assume anything myself. If you want to order it do it
explicitly.

Let us take a simple case by creating three DFs based on existing tables

val s =
HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")

now let us join these tables

val rs =
s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))

And do ab order explicitly

val rs1 = rs.*orderBy*
("calendar_month_desc","channel_desc").take(5).foreach(println)


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 June 2016 at 14:32, Jestin Ma  wrote:

> If it’s not too much trouble, could I get some pointers/help on this? (see
> link)
>
> http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order
>
> -also, as a side question, do Dataframes support easy reordering of
> columns?
>
> Thank you!
> Jestin
>


Re: Possible to broadcast a function?

2016-06-29 Thread Bin Fan
following this suggestion, Aaron, you may take a look at Alluxio as the
off-heap in-memory data storage as input/output for Spark jobs if that
works for you.

See more intro on how to run Spark with Alluxio as data input / output.

http://www.alluxio.org/documentation/en/Running-Spark-on-Alluxio.html

- Bin

On Wed, Jun 29, 2016 at 8:40 AM, Sonal Goyal  wrote:

> Have you looked at Alluxio? (earlier tachyon)
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
> Reifier at Strata Hadoop World
> 
> Reifier at Spark Summit 2015
> 
>
> 
>
>
>
> On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin 
> wrote:
>
>> The user guide describes a broadcast as a way to move a large dataset to
>> each node:
>>
>> "Broadcast variables allow the programmer to keep a read-only variable
>> cached on each machine rather than shipping a copy of it with tasks. They
>> can be used, for example, to give every node a copy of a large input
>> dataset in an efficient manner."
>>
>> And the broadcast example shows it being used with a variable.
>>
>> But, is it somehow possible to instead broadcast a function that can be
>> executed once, per node?
>>
>> My use case is the following:
>>
>> I have a large data structure that I currently create on each executor.
>> The way that I create it is a hack.  That is, when the RDD function is
>> executed on the executor, I block, load a bunch of data (~250 GiB) from an
>> external data source, create the data structure as a static object in the
>> JVM, and then resume execution.  This works, but it ends up costing me a
>> lot of extra memory (i.e. a few TiB when I have a lot of executors).
>>
>> What I'd like to do is use the broadcast mechanism to load the data
>> structure once, per node.  But, I can't serialize the data structure from
>> the driver.
>>
>> Any ideas?
>>
>> Thanks!
>>
>> Aaron
>>
>>
>


Re: Using R code as part of a Spark Application

2016-06-29 Thread Sean Owen
Here we (or certainly I) am not talking about R Server, but plain vanilla
R, as used with Spark and SparkR. Currently, SparkR doesn't distribute R
code at all (it used to, sort of), so I'm wondering if that is changing
back.

On Wed, Jun 29, 2016 at 10:53 PM, John Aherne 
wrote:

> I don't think R server requires R on the executor nodes. I originally set
> up a SparkR cluster for our Data Scientist on Azure which required that I
> install R on each node, but for the R Server set up, there is an extra edge
> node with R server that they connect to. From what little research I was
> able to do, it seems that there are some special functions in R Server that
> can distribute the work to the cluster.
>
> Documentation is light, and hard to find but I found this helpful:
>
> https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/
>
>
>
> On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen  wrote:
>
>> Oh, interesting: does this really mean the return of distributing R
>> code from driver to executors and running it remotely, or do I
>> misunderstand? this would require having R on the executor nodes like
>> it used to?
>>
>> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh  wrote:
>> > There is some new SparkR functionality coming in Spark 2.0, such as
>> > "dapply". You could use SparkR to load a Parquet file and then run
>> "dapply"
>> > to apply a function to each partition of a DataFrame.
>> >
>> > Info about loading Parquet file:
>> >
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>> >
>> > API doc for "dapply":
>> >
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>> >
>> > Xinh
>> >
>> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog 
>> wrote:
>> >>
>> >> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
>> >> stuff you want to do on the Rscript stdin,  p
>> >>
>> >>
>> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <
>> gilad.lan...@clicktale.com>
>> >> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>>
>> >>>
>> >>> I want to use R code as part of spark application (the same way I
>> would
>> >>> do with Scala/Python).  I want to be able to run an R syntax as a map
>> >>> function on a big Spark dataframe loaded from a parquet file.
>> >>>
>> >>> Is this even possible or the only way to use R is as part of RStudio
>> >>> orchestration of our Spark  cluster?
>> >>>
>> >>>
>> >>>
>> >>> Thanks for the help!
>> >>>
>> >>>
>> >>>
>> >>> Gilad
>> >>>
>> >>>
>> >>
>> >>
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
>
> John Aherne
> Big Data and SQL Developer
>
> [image: JustEnough Logo]
>
> Cell:
> Email:
> Skype:
> Web:
>
> +1 (303) 809-9718
> john.ahe...@justenough.com
> john.aherne.je
> www.justenough.com
>
>
> Confidentiality Note: The information contained in this email and document(s) 
> attached are for the exclusive use of the addressee and may contain 
> confidential, privileged and non-disclosable information. If the recipient of 
> this email is not the addressee, such recipient is strictly prohibited from 
> reading, photocopying, distribution or otherwise using this email or its 
> contents in any way.
>
>


Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
I don't think R server requires R on the executor nodes. I originally set
up a SparkR cluster for our Data Scientist on Azure which required that I
install R on each node, but for the R Server set up, there is an extra edge
node with R server that they connect to. From what little research I was
able to do, it seems that there are some special functions in R Server that
can distribute the work to the cluster.

Documentation is light, and hard to find but I found this helpful:
https://blogs.msdn.microsoft.com/uk_faculty_connection/2016/05/10/r-server-for-hdinsight-running-on-microsoft-azure-cloud-data-science-challenges/



On Wed, Jun 29, 2016 at 3:29 PM, Sean Owen  wrote:

> Oh, interesting: does this really mean the return of distributing R
> code from driver to executors and running it remotely, or do I
> misunderstand? this would require having R on the executor nodes like
> it used to?
>
> On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh  wrote:
> > There is some new SparkR functionality coming in Spark 2.0, such as
> > "dapply". You could use SparkR to load a Parquet file and then run
> "dapply"
> > to apply a function to each partition of a DataFrame.
> >
> > Info about loading Parquet file:
> >
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
> >
> > API doc for "dapply":
> >
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
> >
> > Xinh
> >
> > On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog 
> wrote:
> >>
> >> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
> >> stuff you want to do on the Rscript stdin,  p
> >>
> >>
> >> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau <
> gilad.lan...@clicktale.com>
> >> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>>
> >>> I want to use R code as part of spark application (the same way I would
> >>> do with Scala/Python).  I want to be able to run an R syntax as a map
> >>> function on a big Spark dataframe loaded from a parquet file.
> >>>
> >>> Is this even possible or the only way to use R is as part of RStudio
> >>> orchestration of our Spark  cluster?
> >>>
> >>>
> >>>
> >>> Thanks for the help!
> >>>
> >>>
> >>>
> >>> Gilad
> >>>
> >>>
> >>
> >>
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.


PySpark crashed because "remote RPC client disassociated"

2016-06-29 Thread jw.cmu
I am running my own PySpark application (solving matrix factorization using
Gemulla's DSGD algorithm). The program seemed to work fine on smaller
movielens dataset but failed on larger Netflix data. It too about 14 hours
to complete two iterations and lost an executor (I used totally 8 executors
all on one 16-core machine) because "remote RPC client disassociated".

Below is the full error message. I would appreciate any pointer on debugging
this problem. Thanks!

16/06/29 12:43:50 WARN TaskSetManager: Lost task 7.0 in stage 2581.0 (TID
9304, no139.nome.nx): TaskKilled (killed intentionally)
py4j.protocol.Py4JJavaError16/06/29 12:43:53 WARN TaskSetManager: Lost task
6.0 in stage 2581.0 (TID 9303, no139.nome.nx): TaskKilled (killed
intentionally)
16/06/29 12:43:53 WARN TaskSetManager: Lost task 2.0 in stage 2581.0 (TID
9299, no139.nome.nx): TaskKilled (killed intentionally)
16/06/29 12:43:53 INFO TaskSchedulerImpl: Removed TaskSet 2581.0, whose
tasks have all completed, from pool
: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3
in stage 2581.0 failed 1 times, most recent failure: Lost task 3.0 in stage
2581.0 (TID 9300, no139.nome.nx): ExecutorLostFailure (executor 5 exited
caused by one of the running tasks) Reason: Remote RPC client disassociated.
Likely due to containers exceeding thresholds, or network issues. Check
driver logs for WARN messages.
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor96.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-crashed-because-remote-RPC-client-disassociated-tp27248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Error report file is deleted automatically after spark application finished

2016-06-29 Thread prateek arora

Hi

My Spark application was crashed and show information

LogType:stdout
Log Upload Time:Wed Jun 29 14:38:03 -0700 2016
LogLength:1096
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x7f67baa0d221, pid=12207, tid=140083473176320
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [libcaffe.so.1.0.0-rc3+0x786221]  sgemm_kernel+0x21
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
#
/yarn/nm/usercache/ubuntu/appcache/application_1467236060045_0001/container_1467236060045_0001_01_03/hs_err_pid12207.log



but I am not able to found 
"/yarn/nm/usercache/ubuntu/appcache/application_1467236060045_0001/container_1467236060045_0001_01_03/hs_err_pid12207.log"
 
file . its deleted  automatically after Spark application
 finished


how  to retain report file , i am running spark with yarn .

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-report-file-is-deleted-automatically-after-spark-application-finished-tp27247.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using R code as part of a Spark Application

2016-06-29 Thread Sean Owen
Oh, interesting: does this really mean the return of distributing R
code from driver to executors and running it remotely, or do I
misunderstand? this would require having R on the executor nodes like
it used to?

On Wed, Jun 29, 2016 at 5:53 PM, Xinh Huynh  wrote:
> There is some new SparkR functionality coming in Spark 2.0, such as
> "dapply". You could use SparkR to load a Parquet file and then run "dapply"
> to apply a function to each partition of a DataFrame.
>
> Info about loading Parquet file:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>
> API doc for "dapply":
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>
> Xinh
>
> On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog  wrote:
>>
>> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
>> stuff you want to do on the Rscript stdin,  p
>>
>>
>> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau 
>> wrote:
>>>
>>> Hello,
>>>
>>>
>>>
>>> I want to use R code as part of spark application (the same way I would
>>> do with Scala/Python).  I want to be able to run an R syntax as a map
>>> function on a big Spark dataframe loaded from a parquet file.
>>>
>>> Is this even possible or the only way to use R is as part of RStudio
>>> orchestration of our Spark  cluster?
>>>
>>>
>>>
>>> Thanks for the help!
>>>
>>>
>>>
>>> Gilad
>>>
>>>
>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Possible to broadcast a function?

2016-06-29 Thread Sean Owen
Ah, I completely read over the "250GB" part. Yeah you have a huge heap
then and indeed you can run into problems with GC pauses. You can
probably still manage such huge executors with a fair bit of care with
the GC and memory settings, and, you have a good reason to consider
this. In particular I imagine you will want a quite large old
generation on the heap, and focus on settings that optimize for low
pause rather than throughput.

If these nodes are entirely dedicated to one app, yes ideally you let
1 executor take all usable memory on each, if you can do so without GC
becoming an issue. Indeed fitting the driver becomes an issue because
the memory size must be the same for all executors. You could run the
drivers on another smaller machine?

Since your code will be using lots of heap unknown to Spark, then you
will want to make sure you tell Spark to limit cache / shuffle memory
more than usual, so that it doesn't run you out of memory.

On Wed, Jun 29, 2016 at 5:40 PM, Aaron Perrin  wrote:
> From what I've read, people had seen performance issues when the JVM used
> more than 60 GiB of memory.  I haven't tested it myself, but I guess not
> true?
>
> Also, how does one optimize memory when the driver allocates some on one
> node?
>
> For example, let's say my cluster has N nodes each with 500 GiB of memory.
> And, let's say roughly, the amount of memory available per executors is
> ~80%, or ~400 GiB.  So, you're suggesting I should allocate ~400 GiB of mem
> to the executor?  How does that affect the node that's hosting the driver,
> when the driver uses ~10-15 GiB?  Or, do I have to decrease executor memory
> to ~385 across all executors?
>
> (Note: I'm running on Yarn, which may affect this.)
>
> Thanks,
>
> Aaron
>
>
> On Wed, Jun 29, 2016 at 12:09 PM Sean Owen  wrote:
>>
>> If you have one executor per machine, which is the right default thing
>> to do, and this is a singleton in the JVM, then this does just have
>> one copy per machine. Of course an executor is tied to an app, so if
>> you mean to hold this data across executors that won't help.
>>
>>
>> On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin 
>> wrote:
>> > The user guide describes a broadcast as a way to move a large dataset to
>> > each node:
>> >
>> > "Broadcast variables allow the programmer to keep a read-only variable
>> > cached on each machine rather than shipping a copy of it with tasks.
>> > They
>> > can be used, for example, to give every node a copy of a large input
>> > dataset
>> > in an efficient manner."
>> >
>> > And the broadcast example shows it being used with a variable.
>> >
>> > But, is it somehow possible to instead broadcast a function that can be
>> > executed once, per node?
>> >
>> > My use case is the following:
>> >
>> > I have a large data structure that I currently create on each executor.
>> > The
>> > way that I create it is a hack.  That is, when the RDD function is
>> > executed
>> > on the executor, I block, load a bunch of data (~250 GiB) from an
>> > external
>> > data source, create the data structure as a static object in the JVM,
>> > and
>> > then resume execution.  This works, but it ends up costing me a lot of
>> > extra
>> > memory (i.e. a few TiB when I have a lot of executors).
>> >
>> > What I'd like to do is use the broadcast mechanism to load the data
>> > structure once, per node.  But, I can't serialize the data structure
>> > from
>> > the driver.
>> >
>> > Any ideas?
>> >
>> > Thanks!
>> >
>> > Aaron
>> >

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Friendly Reminder: Spark Summit EU CfP Deadline July 1, 2016

2016-06-29 Thread Jules Damji
Hello All,

If you haven't submitted a CfP for Spark Summit EU, the deadline is this 
Friday, July 1st. 

Submit at https://spark-summit.org/eu-2016/

Cheers!

Jules
Spark Community Evangelist
Databricks, Inc.

Sent from my iPhone
Pardon the dumb thumb typos :)

Kudu Connector

2016-06-29 Thread Benjamin Kim
I was wondering if anyone, who is a Spark Scala developer, would be willing to 
continue the work done for the Kudu connector?

https://github.com/apache/incubator-kudu/tree/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu

I have been testing and using Kudu for the past month and comparing against 
HBase. It seems like a promising data store to complement Spark. It fills the 
gap in our company as a fast updatable data store. We stream GB’s of data in 
and run analytical queries against it, which run in well below a minute 
typically. According to the Kudu users group, all it needs is to add SQL (JDBC) 
friendly features (CREATE TABLE, intuitive save modes (append = upsert and 
overwrite = truncate + insert), DELETE, etc.) and improve performance by 
implementing locality.

For reference, here is the page on contributing.

http://kudu.apache.org/docs/contributing.html

I am hoping that for individuals in the Spark community it would be relatively 
easy.

Thanks!
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Set the node the spark driver will be started

2016-06-29 Thread Bryan Cutler
Hi Felix,

I think the problem you are describing has been fixed in later versions,
check out this JIRA https://issues.apache.org/jira/browse/SPARK-13803


On Wed, Jun 29, 2016 at 9:27 AM, Mich Talebzadeh 
wrote:

> Fine. in standalone mode spark uses its own scheduling as opposed to Yarn
> or anything else.
>
> As a matter of interest can you start spark-submit from any node in the
> cluster? Are these all have the same or similar CPU and RAM?
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 June 2016 at 10:54, Felix Massem 
> wrote:
>
>> In addition we are not using Yarn we are using the standalone mode and
>> the driver will be started with the deploy-mode cluster
>>
>> Thx Felix
>> Felix Massem | IT-Consultant | Karlsruhe
>> mobil: +49 (0) 172.2919848
>>
>> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
>> www.more4fi.de
>>
>> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
>> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
>> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen
>> Schütz
>>
>> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält
>> vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht
>> der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben,
>> informieren Sie bitte sofort den Absender und löschen Sie diese E-Mail und
>> evtl. beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder
>> Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser
>> E-Mail ist nicht gestattet.
>>
>> Am 29.06.2016 um 11:13 schrieb Felix Massem > >:
>>
>> Hey Mich,
>>
>> the distribution is like not given. Just right now I have 15 applications
>> and all 15 drivers are running on one node. This is just after giving all
>> machines a little more memory.
>> Before I had like 15 applications and about 13 driver where running on
>> one machine. While trying to submit a new job I got OOM exceptions which
>> took down my cassandra service only to start the driver on the same node
>> where  all the other 13 drivers where running.
>>
>> Thx and best regards
>> Felix
>>
>>
>> Felix Massem | IT-Consultant | Karlsruhe
>> mobil: +49 (0) 172.2919848
>>
>> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
>> www.more4fi.de
>>
>> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
>> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
>> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen
>> Schütz
>>
>> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält
>> vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht
>> der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben,
>> informieren Sie bitte sofort den Absender und löschen Sie diese E-Mail und
>> evtl. beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder
>> Öffnen evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser
>> E-Mail ist nicht gestattet.
>>
>> Am 28.06.2016 um 17:55 schrieb Mich Talebzadeh > >:
>>
>> Hi Felix,
>>
>> In Yarn-cluster mode the resource manager Yarn is expected to take care
>> of that.
>>
>> Are you getting some skewed distribution with drivers created through
>> spark-submit on different nodes?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 28 June 2016 at 16:06, Felix Massem 
>> wrote:
>>
>>> Hey Mich,
>>>
>>> thx for the fast reply.
>>>
>>> We are using it in cluster mode and spark version 1.5.2
>>>
>>> Greets Felix
>>>
>>>
>>> Felix Massem | IT-Consultant | Karlsruhe
>>> mobil: +49 (0) 172.2919848
>>>
>>> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
>>> www.more4fi.de
>>>

Re: Unsubscribe - 3rd time

2016-06-29 Thread Mich Talebzadeh
Indeed Nicholas very valid point

Example of the new email listings like below for ISUG etc that allow all
options including unsubscribe


-End Original Message-

*Site Links:* View post online
   View mailing
list online    Start new thread via
email    Unsubscribe from this mailing list
   Manage your
subscription 


Use of this email content is governed by the terms of service at:
http://my.isug.com/p/cm/ld/fid=464
--

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 June 2016 at 18:33, Nicholas Chammas 
wrote:

> > I'm not sure I've ever come across an email list that allows you to
> unsubscribe by responding to the list with "unsubscribe".
>
> Many noreply lists (e.g. companies sending marketing email) actually work
> that way, which is probably what most people are used to these days.
>
> What this list needs is an unsubscribe link in the footer, like most
> modern mailing lists have. Work to add this in is already in progress here:
> https://issues.apache.org/jira/browse/INFRA-12185
>
> Nick
>
> On Wed, Jun 29, 2016 at 12:57 PM Jonathan Kelly 
> wrote:
>
>> If at first you don't succeed, try, try again. But please don't. :)
>>
>> See the "unsubscribe" link here: http://spark.apache.org/community.html
>>
>> I'm not sure I've ever come across an email list that allows you to
>> unsubscribe by responding to the list with "unsubscribe". At least, all of
>> the Apache ones have a separate address to which you send
>> subscribe/unsubscribe messages. And yet people try to send "unsubscribe"
>> messages to the actual list almost every day.
>>
>> On Wed, Jun 29, 2016 at 9:03 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> LOL. Bravely said Joaquin.
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 June 2016 at 16:54, Joaquin Alzola 
>>> wrote:
>>>
 And 3rd time is not enough to know that unsubscribe is done through à
 user-unsubscr...@spark.apache.org



 *From:* Steve Florence [mailto:sflore...@ypm.com]
 *Sent:* 29 June 2016 16:47
 *To:* user@spark.apache.org
 *Subject:* Unsubscribe - 3rd time




 This email is confidential and may be subject to privilege. If you are
 not the intended recipient, please do not copy or disclose its content but
 contact the sender immediately upon receipt.

>>>
>>>


Apache Spark Is Hanging when fetch data from SQL Server 2008

2016-06-29 Thread Gastón Schabas
Hi everyone. I'm experiencing an issue when I try to fetch data from SQL
Server. This is my context
Ubuntu 14.04 LTS
Apache Spark 1.4.0
SQL Server 2008
Scala 2.10.5
Sbt 0.13.11

I'm trying to fetch data from a table in SQL Server 2008 that has
85.000.000 records. I just only need around 200.000 records. This is my code
val df = sqlContext.read.jdbc("anUrl", "aTableName", Array(s"timestamp >=
'2016-06-21T00:00:00'", s"timestamp < '2016-06-22T00:00:00'"), new
Properties)

if I do this
df.take(5).foreach(println)
it works without any trouble.

if I do this
println(df.count()) // this should return 200.000
the application hangs

I've entered to http://localhost:4040/ to check what spark is doing. When I
enter to the job details, it shows that is running the count method and
this is the detail
org.apache.spark.sql.DataFrame.count(DataFrame.scala:1269)
SkipOverPlaysInWeek$.main(SkipOverPlaysInWeek.scala:88)

Thanks,

Gastón


groupBy cannot handle large RDDs

2016-06-29 Thread Kaiyin Zhong
Could anyone have a look at this? It looks like a bug:

http://stackoverflow.com/questions/38106554/groupby-cannot-handle-large-rdds

Best regards,

Kaiyin ZHONG


Re: Unsubscribe - 3rd time

2016-06-29 Thread Nicholas Chammas
> I'm not sure I've ever come across an email list that allows you to
unsubscribe by responding to the list with "unsubscribe".

Many noreply lists (e.g. companies sending marketing email) actually work
that way, which is probably what most people are used to these days.

What this list needs is an unsubscribe link in the footer, like most modern
mailing lists have. Work to add this in is already in progress here:
https://issues.apache.org/jira/browse/INFRA-12185

Nick

On Wed, Jun 29, 2016 at 12:57 PM Jonathan Kelly 
wrote:

> If at first you don't succeed, try, try again. But please don't. :)
>
> See the "unsubscribe" link here: http://spark.apache.org/community.html
>
> I'm not sure I've ever come across an email list that allows you to
> unsubscribe by responding to the list with "unsubscribe". At least, all of
> the Apache ones have a separate address to which you send
> subscribe/unsubscribe messages. And yet people try to send "unsubscribe"
> messages to the actual list almost every day.
>
> On Wed, Jun 29, 2016 at 9:03 AM Mich Talebzadeh 
> wrote:
>
>> LOL. Bravely said Joaquin.
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 June 2016 at 16:54, Joaquin Alzola 
>> wrote:
>>
>>> And 3rd time is not enough to know that unsubscribe is done through à
>>> user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>> *From:* Steve Florence [mailto:sflore...@ypm.com]
>>> *Sent:* 29 June 2016 16:47
>>> *To:* user@spark.apache.org
>>> *Subject:* Unsubscribe - 3rd time
>>>
>>>
>>>
>>>
>>> This email is confidential and may be subject to privilege. If you are
>>> not the intended recipient, please do not copy or disclose its content but
>>> contact the sender immediately upon receipt.
>>>
>>
>>


Re: Using R code as part of a Spark Application

2016-06-29 Thread Jörn Franke
Still you need sparkR
> On 29 Jun 2016, at 19:14, John Aherne  wrote:
> 
> Microsoft Azure has an option to create a spark cluster with R Server. MS 
> bought RevoScale (I think that was the name) and just recently deployed it. 
> 
>> On Wed, Jun 29, 2016 at 10:53 AM, Xinh Huynh  wrote:
>> There is some new SparkR functionality coming in Spark 2.0, such as 
>> "dapply". You could use SparkR to load a Parquet file and then run "dapply" 
>> to apply a function to each partition of a DataFrame.
>> 
>> Info about loading Parquet file:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>> 
>> API doc for "dapply":
>> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>> 
>> Xinh
>> 
>>> On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog  wrote:
>>> try Spark pipeRDD's , you can invoke the R script from pipe , push  the 
>>> stuff you want to do on the Rscript stdin,  p 
>>> 
>>> 
 On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau  
 wrote:
 Hello,
 
  
 
 I want to use R code as part of spark application (the same way I would do 
 with Scala/Python).  I want to be able to run an R syntax as a map 
 function on a big Spark dataframe loaded from a parquet file.
 
 Is this even possible or the only way to use R is as part of RStudio 
 orchestration of our Spark  cluster?
 
  
 
 Thanks for the help!
 
  
 
 Gilad
 
> 
> 
> 
> -- 
> John Aherne
> Big Data and SQL Developer
> 
> 
> 
> Cell:
> Email:
> Skype:
> Web:
> 
> +1 (303) 809-9718
> john.ahe...@justenough.com
> john.aherne.je
> www.justenough.com
> 
> 
> Confidentiality Note: The information contained in this email and document(s) 
> attached are for the exclusive use of the addressee and may contain 
> confidential, privileged and non-disclosable information. If the recipient of 
> this email is not the addressee, such recipient is strictly prohibited from 
> reading, photocopying, distribution or otherwise using this email or its 
> contents in any way.


Re: Using R code as part of a Spark Application

2016-06-29 Thread John Aherne
Microsoft Azure has an option to create a spark cluster with R Server. MS
bought RevoScale (I think that was the name) and just recently deployed it.

On Wed, Jun 29, 2016 at 10:53 AM, Xinh Huynh  wrote:

> There is some new SparkR functionality coming in Spark 2.0, such as
> "dapply". You could use SparkR to load a Parquet file and then run "dapply"
> to apply a function to each partition of a DataFrame.
>
> Info about loading Parquet file:
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources
>
> API doc for "dapply":
>
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html
>
> Xinh
>
> On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog  wrote:
>
>> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
>> stuff you want to do on the Rscript stdin,  p
>>
>>
>> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau > > wrote:
>>
>>> Hello,
>>>
>>>
>>>
>>> I want to use R code as part of spark application (the same way I would
>>> do with Scala/Python).  I want to be able to run an R syntax as a map
>>> function on a big Spark dataframe loaded from a parquet file.
>>>
>>> Is this even possible or the only way to use R is as part of RStudio
>>> orchestration of our Spark  cluster?
>>>
>>>
>>>
>>> Thanks for the help!
>>>
>>>
>>>
>>> Gilad
>>>
>>>
>>>
>>
>>
>


-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.


Re: Unsubscribe - 3rd time

2016-06-29 Thread Jonathan Kelly
If at first you don't succeed, try, try again. But please don't. :)

See the "unsubscribe" link here: http://spark.apache.org/community.html

I'm not sure I've ever come across an email list that allows you to
unsubscribe by responding to the list with "unsubscribe". At least, all of
the Apache ones have a separate address to which you send
subscribe/unsubscribe messages. And yet people try to send "unsubscribe"
messages to the actual list almost every day.

On Wed, Jun 29, 2016 at 9:03 AM Mich Talebzadeh 
wrote:

> LOL. Bravely said Joaquin.
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 June 2016 at 16:54, Joaquin Alzola 
> wrote:
>
>> And 3rd time is not enough to know that unsubscribe is done through à
>> user-unsubscr...@spark.apache.org
>>
>>
>>
>> *From:* Steve Florence [mailto:sflore...@ypm.com]
>> *Sent:* 29 June 2016 16:47
>> *To:* user@spark.apache.org
>> *Subject:* Unsubscribe - 3rd time
>>
>>
>>
>>
>> This email is confidential and may be subject to privilege. If you are
>> not the intended recipient, please do not copy or disclose its content but
>> contact the sender immediately upon receipt.
>>
>
>


Re: Using R code as part of a Spark Application

2016-06-29 Thread Xinh Huynh
There is some new SparkR functionality coming in Spark 2.0, such as
"dapply". You could use SparkR to load a Parquet file and then run "dapply"
to apply a function to each partition of a DataFrame.

Info about loading Parquet file:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/sparkr.html#from-data-sources

API doc for "dapply":
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc1-docs/api/R/index.html

Xinh

On Wed, Jun 29, 2016 at 6:54 AM, sujeet jog  wrote:

> try Spark pipeRDD's , you can invoke the R script from pipe , push  the
> stuff you want to do on the Rscript stdin,  p
>
>
> On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau 
> wrote:
>
>> Hello,
>>
>>
>>
>> I want to use R code as part of spark application (the same way I would
>> do with Scala/Python).  I want to be able to run an R syntax as a map
>> function on a big Spark dataframe loaded from a parquet file.
>>
>> Is this even possible or the only way to use R is as part of RStudio
>> orchestration of our Spark  cluster?
>>
>>
>>
>> Thanks for the help!
>>
>>
>>
>> Gilad
>>
>>
>>
>
>


Re: Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
>From what I've read, people had seen performance issues when the JVM used
more than 60 GiB of memory.  I haven't tested it myself, but I guess not
true?

Also, how does one optimize memory when the driver allocates some on one
node?

For example, let's say my cluster has N nodes each with 500 GiB of memory.
And, let's say roughly, the amount of memory available per executors is
~80%, or ~400 GiB.  So, you're suggesting I should allocate ~400 GiB of mem
to the executor?  How does that affect the node that's hosting the driver,
when the driver uses ~10-15 GiB?  Or, do I have to decrease executor memory
to ~385 across all executors?

(Note: I'm running on Yarn, which may affect this.)

Thanks,

Aaron


On Wed, Jun 29, 2016 at 12:09 PM Sean Owen  wrote:

> If you have one executor per machine, which is the right default thing
> to do, and this is a singleton in the JVM, then this does just have
> one copy per machine. Of course an executor is tied to an app, so if
> you mean to hold this data across executors that won't help.
>
>
> On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin 
> wrote:
> > The user guide describes a broadcast as a way to move a large dataset to
> > each node:
> >
> > "Broadcast variables allow the programmer to keep a read-only variable
> > cached on each machine rather than shipping a copy of it with tasks. They
> > can be used, for example, to give every node a copy of a large input
> dataset
> > in an efficient manner."
> >
> > And the broadcast example shows it being used with a variable.
> >
> > But, is it somehow possible to instead broadcast a function that can be
> > executed once, per node?
> >
> > My use case is the following:
> >
> > I have a large data structure that I currently create on each executor.
> The
> > way that I create it is a hack.  That is, when the RDD function is
> executed
> > on the executor, I block, load a bunch of data (~250 GiB) from an
> external
> > data source, create the data structure as a static object in the JVM, and
> > then resume execution.  This works, but it ends up costing me a lot of
> extra
> > memory (i.e. a few TiB when I have a lot of executors).
> >
> > What I'd like to do is use the broadcast mechanism to load the data
> > structure once, per node.  But, I can't serialize the data structure from
> > the driver.
> >
> > Any ideas?
> >
> > Thanks!
> >
> > Aaron
> >
>


Re: Set the node the spark driver will be started

2016-06-29 Thread Mich Talebzadeh
Fine. in standalone mode spark uses its own scheduling as opposed to Yarn
or anything else.

As a matter of interest can you start spark-submit from any node in the
cluster? Are these all have the same or similar CPU and RAM?


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 June 2016 at 10:54, Felix Massem  wrote:

> In addition we are not using Yarn we are using the standalone mode and the
> driver will be started with the deploy-mode cluster
>
> Thx Felix
> Felix Massem | IT-Consultant | Karlsruhe
> mobil: +49 (0) 172.2919848
>
> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
> www.more4fi.de
>
> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
>
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
> beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
> evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
> nicht gestattet.
>
> Am 29.06.2016 um 11:13 schrieb Felix Massem :
>
> Hey Mich,
>
> the distribution is like not given. Just right now I have 15 applications
> and all 15 drivers are running on one node. This is just after giving all
> machines a little more memory.
> Before I had like 15 applications and about 13 driver where running on one
> machine. While trying to submit a new job I got OOM exceptions which took
> down my cassandra service only to start the driver on the same node where
>  all the other 13 drivers where running.
>
> Thx and best regards
> Felix
>
>
> Felix Massem | IT-Consultant | Karlsruhe
> mobil: +49 (0) 172.2919848
>
> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
> www.more4fi.de
>
> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
>
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
> beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
> evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
> nicht gestattet.
>
> Am 28.06.2016 um 17:55 schrieb Mich Talebzadeh  >:
>
> Hi Felix,
>
> In Yarn-cluster mode the resource manager Yarn is expected to take care of
> that.
>
> Are you getting some skewed distribution with drivers created through
> spark-submit on different nodes?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 June 2016 at 16:06, Felix Massem 
> wrote:
>
>> Hey Mich,
>>
>> thx for the fast reply.
>>
>> We are using it in cluster mode and spark version 1.5.2
>>
>> Greets Felix
>>
>>
>> Felix Massem | IT-Consultant | Karlsruhe
>> mobil: +49 (0) 172.2919848
>>
>> www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
>> www.more4fi.de
>>
>> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
>> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
>> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen
>> Schütz
>>
>> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält
>> vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht
>> der richtige Adressat sind oder diese 

Re: Possible to broadcast a function?

2016-06-29 Thread Sean Owen
If you have one executor per machine, which is the right default thing
to do, and this is a singleton in the JVM, then this does just have
one copy per machine. Of course an executor is tied to an app, so if
you mean to hold this data across executors that won't help.


On Wed, Jun 29, 2016 at 3:00 PM, Aaron Perrin  wrote:
> The user guide describes a broadcast as a way to move a large dataset to
> each node:
>
> "Broadcast variables allow the programmer to keep a read-only variable
> cached on each machine rather than shipping a copy of it with tasks. They
> can be used, for example, to give every node a copy of a large input dataset
> in an efficient manner."
>
> And the broadcast example shows it being used with a variable.
>
> But, is it somehow possible to instead broadcast a function that can be
> executed once, per node?
>
> My use case is the following:
>
> I have a large data structure that I currently create on each executor.  The
> way that I create it is a hack.  That is, when the RDD function is executed
> on the executor, I block, load a bunch of data (~250 GiB) from an external
> data source, create the data structure as a static object in the JVM, and
> then resume execution.  This works, but it ends up costing me a lot of extra
> memory (i.e. a few TiB when I have a lot of executors).
>
> What I'd like to do is use the broadcast mechanism to load the data
> structure once, per node.  But, I can't serialize the data structure from
> the driver.
>
> Any ideas?
>
> Thanks!
>
> Aaron
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unsubscribe - 3rd time

2016-06-29 Thread Mich Talebzadeh
LOL. Bravely said Joaquin.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 June 2016 at 16:54, Joaquin Alzola  wrote:

> And 3rd time is not enough to know that unsubscribe is done through à
> user-unsubscr...@spark.apache.org
>
>
>
> *From:* Steve Florence [mailto:sflore...@ypm.com]
> *Sent:* 29 June 2016 16:47
> *To:* user@spark.apache.org
> *Subject:* Unsubscribe - 3rd time
>
>
>
>
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>


RE: Unsubscribe - 3rd time

2016-06-29 Thread Joaquin Alzola
And 3rd time is not enough to know that unsubscribe is done through --> 
user-unsubscr...@spark.apache.org

From: Steve Florence [mailto:sflore...@ypm.com]
Sent: 29 June 2016 16:47
To: user@spark.apache.org
Subject: Unsubscribe - 3rd time


This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.


Unsubscribe - 3rd time

2016-06-29 Thread Steve Florence



Re: Possible to broadcast a function?

2016-06-29 Thread Sonal Goyal
Have you looked at Alluxio? (earlier tachyon)

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Wed, Jun 29, 2016 at 7:30 PM, Aaron Perrin  wrote:

> The user guide describes a broadcast as a way to move a large dataset to
> each node:
>
> "Broadcast variables allow the programmer to keep a read-only variable
> cached on each machine rather than shipping a copy of it with tasks. They
> can be used, for example, to give every node a copy of a large input
> dataset in an efficient manner."
>
> And the broadcast example shows it being used with a variable.
>
> But, is it somehow possible to instead broadcast a function that can be
> executed once, per node?
>
> My use case is the following:
>
> I have a large data structure that I currently create on each executor.
> The way that I create it is a hack.  That is, when the RDD function is
> executed on the executor, I block, load a bunch of data (~250 GiB) from an
> external data source, create the data structure as a static object in the
> JVM, and then resume execution.  This works, but it ends up costing me a
> lot of extra memory (i.e. a few TiB when I have a lot of executors).
>
> What I'd like to do is use the broadcast mechanism to load the data
> structure once, per node.  But, I can't serialize the data structure from
> the driver.
>
> Any ideas?
>
> Thanks!
>
> Aaron
>
>


Re: Do tasks from the same application run in different JVMs

2016-06-29 Thread Mathieu Longtin
Same JVMs.

On Wed, Jun 29, 2016 at 8:48 AM Huang Meilong  wrote:

> Hi,
>
> In spark, tasks from different applications run in different JVMs, then
> what about tasks from the same application?
>
-- 
Mathieu Longtin
1-514-803-8977


Possible to broadcast a function?

2016-06-29 Thread Aaron Perrin
The user guide describes a broadcast as a way to move a large dataset to
each node:

"Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks. They
can be used, for example, to give every node a copy of a large input
dataset in an efficient manner."

And the broadcast example shows it being used with a variable.

But, is it somehow possible to instead broadcast a function that can be
executed once, per node?

My use case is the following:

I have a large data structure that I currently create on each executor.
The way that I create it is a hack.  That is, when the RDD function is
executed on the executor, I block, load a bunch of data (~250 GiB) from an
external data source, create the data structure as a static object in the
JVM, and then resume execution.  This works, but it ends up costing me a
lot of extra memory (i.e. a few TiB when I have a lot of executors).

What I'd like to do is use the broadcast mechanism to load the data
structure once, per node.  But, I can't serialize the data structure from
the driver.

Any ideas?

Thanks!

Aaron


Re: Using R code as part of a Spark Application

2016-06-29 Thread sujeet jog
try Spark pipeRDD's , you can invoke the R script from pipe , push  the
stuff you want to do on the Rscript stdin,  p


On Wed, Jun 29, 2016 at 7:10 PM, Gilad Landau 
wrote:

> Hello,
>
>
>
> I want to use R code as part of spark application (the same way I would do
> with Scala/Python).  I want to be able to run an R syntax as a map function
> on a big Spark dataframe loaded from a parquet file.
>
> Is this even possible or the only way to use R is as part of RStudio
> orchestration of our Spark  cluster?
>
>
>
> Thanks for the help!
>
>
>
> Gilad
>
>
>


Re: Spark jobs

2016-06-29 Thread sujeet jog
check if this helps,

from multiprocessing import Process

def training() :
print ("Training Workflow")

cmd = spark/bin/spark-submit  ./ml.py & "
os.system(cmd)

w_training  = Process(target = training)



On Wed, Jun 29, 2016 at 6:28 PM, Joaquin Alzola 
wrote:

> Hi,
>
>
>
> This is a totally newbie question but I seem not to find the link ….. when
> I create a spark-submit python script to be launch …
>
>
>
> how should I call it from the main python script with a subprocess.popen?
>
>
>
> BR
>
>
>
> Joaquin
>
>
>
>
>
>
>
>
>
>
>
>
> This email is confidential and may be subject to privilege. If you are not
> the intended recipient, please do not copy or disclose its content but
> contact the sender immediately upon receipt.
>


Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Michael Segel
Hi, 

I’m not sure I understand your initial question… 

Depending on the compression algo, you may or may not be able to split the 
file. 
So if its not splittable, you have a single long running thread. 

My guess is that you end up with a very long single partition. 
If so, if you repartition, you may end up seeing better performance in the 
join. 

I see that you’re using a hive context. 

Have you tried to manually do this using just data frames and compare the DAG 
to the SQL DAG? 

HTH

-Mike

> On Jun 29, 2016, at 9:14 AM, Mich Talebzadeh  
> wrote:
> 
> Hi all,
> 
> It finished in 2 hours 18 minutes!
> 
> Started at
> [29/06/2016 10:25:27.27]
> [148]
> [148]
> [148]
> [148]
> [148]
> Finished at
> [29/06/2016 12:43:33.33]
> 
> I need to dig in more. 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 June 2016 at 10:42, Mich Talebzadeh  > wrote:
> Focusing on Spark job, as I mentioned before Spark is running in local mode 
> with 8GB of memory for both the driver and executor memory.
> 
> However, I still see this enormous Duration time which indicates something is 
> wrong badly!
> 
> Also I got rid of groupBy
> 
>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>   val s = HiveContext.table("sales_staging").select("PROD_ID")
>   val rs = s2.join(s,"prod_id").sort(desc("prod_id")).take(5).foreach(println)
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 29 June 2016 at 10:18, Jörn Franke  > wrote:
> 
> I think the TEZ engine is much more maintained with respect to optimizations 
> related to Orc , hive , vectorizing, querying than the mr engine. It will be 
> definitely better to use it.
> Mr is also deprecated in hive 2.0.
> For me it does not make sense to use mr with hive larger than 1.1.
> 
> As I said, order by might be inefficient to use (not sure if this has 
> changed). You may want to use sort by.
> 
> That being said there are many optimizations methods.
> 
> On 29 Jun 2016, at 00:27, Mich Talebzadeh  > wrote:
> 
>> That is a good point.
>> 
>> The ORC table property is as follows
>> 
>> TBLPROPERTIES ( "orc.compress"="SNAPPY",
>> "orc.stripe.size"="268435456",
>> "orc.row.index.stride"="1")
>> 
>> which puts each stripe at 256MB
>> 
>> Just to clarify this is spark running on Hive tables. I don't think the use 
>> of TEZ, MR or Spark as execution engines is going to make any difference?
>> 
>> This is the same query with Hive on MR
>> 
>> select a.prod_id from sales2 a, sales_staging b where a.prod_id = b.prod_id 
>> order by a.prod_id;
>> 
>> 2016-06-28 23:23:51,203 Stage-1 map = 0%,  reduce = 0%
>> 2016-06-28 23:23:59,480 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.32 
>> sec
>> 2016-06-28 23:24:08,771 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
>> 18.21 sec
>> 2016-06-28 23:24:11,860 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU 
>> 22.34 sec
>> 2016-06-28 23:24:18,021 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 
>> 30.33 sec
>> 2016-06-28 23:24:21,101 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
>> 33.45 sec
>> 2016-06-28 23:24:24,181 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 37.5 
>> sec
>> 2016-06-28 23:24:27,270 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 42.0 
>> sec
>> 2016-06-28 23:24:30,349 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
>> 45.62 sec
>> 2016-06-28 23:24:33,441 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
>> 49.69 sec
>> 2016-06-28 23:24:36,521 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 
>> 52.92 sec
>> 2016-06-28 23:24:39,605 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
>> 56.78 sec
>> 2016-06-28 

Using R code as part of a Spark Application

2016-06-29 Thread Gilad Landau
Hello,

I want to use R code as part of spark application (the same way I would do with 
Scala/Python).  I want to be able to run an R syntax as a map function on a big 
Spark dataframe loaded from a parquet file.
Is this even possible or the only way to use R is as part of RStudio 
orchestration of our Spark  cluster?

Thanks for the help!

Gilad



Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke
Does the same happen if all the tables are in ORC format? It might be just 
simpler to convert the text table to ORC since it is rather small

> On 29 Jun 2016, at 15:14, Mich Talebzadeh  wrote:
> 
> Hi all,
> 
> It finished in 2 hours 18 minutes!
> 
> Started at
> [29/06/2016 10:25:27.27]
> [148]
> [148]
> [148]
> [148]
> [148]
> Finished at
> [29/06/2016 12:43:33.33]
> 
> I need to dig in more. 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 29 June 2016 at 10:42, Mich Talebzadeh  wrote:
>> Focusing on Spark job, as I mentioned before Spark is running in local mode 
>> with 8GB of memory for both the driver and executor memory.
>> 
>> However, I still see this enormous Duration time which indicates something 
>> is wrong badly!
>> 
>> Also I got rid of groupBy
>> 
>>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>>   val s = HiveContext.table("sales_staging").select("PROD_ID")
>>   val rs = 
>> s2.join(s,"prod_id").sort(desc("prod_id")).take(5).foreach(println)
>> 
>> 
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 29 June 2016 at 10:18, Jörn Franke  wrote:
>>> 
>>> I think the TEZ engine is much more maintained with respect to 
>>> optimizations related to Orc , hive , vectorizing, querying than the mr 
>>> engine. It will be definitely better to use it.
>>> Mr is also deprecated in hive 2.0.
>>> For me it does not make sense to use mr with hive larger than 1.1.
>>> 
>>> As I said, order by might be inefficient to use (not sure if this has 
>>> changed). You may want to use sort by.
>>> 
>>> That being said there are many optimizations methods.
>>> 
 On 29 Jun 2016, at 00:27, Mich Talebzadeh  
 wrote:
 
 That is a good point.
 
 The ORC table property is as follows
 
 TBLPROPERTIES ( "orc.compress"="SNAPPY",
 "orc.stripe.size"="268435456",
 "orc.row.index.stride"="1")
 
 which puts each stripe at 256MB
 
 Just to clarify this is spark running on Hive tables. I don't think the 
 use of TEZ, MR or Spark as execution engines is going to make any 
 difference?
 
 This is the same query with Hive on MR
 
 select a.prod_id from sales2 a, sales_staging b where a.prod_id = 
 b.prod_id order by a.prod_id;
 
 2016-06-28 23:23:51,203 Stage-1 map = 0%,  reduce = 0%
 2016-06-28 23:23:59,480 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 
 7.32 sec
 2016-06-28 23:24:08,771 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 
 18.21 sec
 2016-06-28 23:24:11,860 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU 
 22.34 sec
 2016-06-28 23:24:18,021 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 
 30.33 sec
 2016-06-28 23:24:21,101 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 
 33.45 sec
 2016-06-28 23:24:24,181 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 
 37.5 sec
 2016-06-28 23:24:27,270 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 
 42.0 sec
 2016-06-28 23:24:30,349 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
 45.62 sec
 2016-06-28 23:24:33,441 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 
 49.69 sec
 2016-06-28 23:24:36,521 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 
 52.92 sec
 2016-06-28 23:24:39,605 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 
 56.78 sec
 2016-06-28 23:24:42,686 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 
 60.36 sec
 2016-06-28 23:24:45,767 Stage-1 map = 81%,  reduce = 0%, Cumulative CPU 
 63.68 sec
 2016-06-28 23:24:48,842 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU 
 66.92 sec
 2016-06-28 23:24:51,918 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
 70.18 sec
 2016-06-28 23:25:52,354 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
 127.99 sec
 2016-06-28 23:25:57,494 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 
 134.64 sec
 2016-06-28 

Can Spark Dataframes preserve order when joining?

2016-06-29 Thread Jestin Ma
If it’s not too much trouble, could I get some pointers/help on this? (see link)
http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order
 


-also, as a side question, do Dataframes support easy reordering of columns?

Thank you!
Jestin

Spark RDD aggregate action behaves strangely

2016-06-29 Thread Kaiyin Zhong
Could anyone have a look at this?
http://stackoverflow.com/questions/38100918/spark-rdd-aggregate-action-behaves-strangely

Thanks!

Best regards,

Kaiyin ZHONG


Spark jobs

2016-06-29 Thread Joaquin Alzola
Hi,

This is a totally newbie question but I seem not to find the link . when I 
create a spark-submit python script to be launch ...

how should I call it from the main python script with a subprocess.popen?

BR

Joaquin






This email is confidential and may be subject to privilege. If you are not the 
intended recipient, please do not copy or disclose its content but contact the 
sender immediately upon receipt.


Do tasks from the same application run in different JVMs

2016-06-29 Thread Huang Meilong
Hi,

In spark, tasks from different applications run in different JVMs, then what 
about tasks from the same application?


Metadata for the StructField

2016-06-29 Thread Ted Yu
You can specify Metadata for the StructField :

case class StructField(
name: String,
dataType: DataType,
nullable: Boolean = true,
metadata: Metadata = Metadata.empty) {

FYI

On Wed, Jun 29, 2016 at 2:50 AM, pooja mehta  wrote:

> Hi,
>
> Want to add a metadata field to StructField case class in spark.
>
> case class StructField(name: String)
>
> And how to carry over the metadata in query execution.
>
>
>
>


Re: Set the node the spark driver will be started

2016-06-29 Thread Felix Massem
In addition we are not using Yarn we are using the standalone mode and the 
driver will be started with the deploy-mode cluster

Thx Felix
Felix Massem | IT-Consultant | Karlsruhe
mobil: +49 (0) 172.2919848 <>

www.codecentric.de  | blog.codecentric.de 
 | www.meettheexperts.de 
 | www.more4fi.de 

Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter 
Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.

> Am 29.06.2016 um 11:13 schrieb Felix Massem :
> 
> Hey Mich,
> 
> the distribution is like not given. Just right now I have 15 applications and 
> all 15 drivers are running on one node. This is just after giving all 
> machines a little more memory.
> Before I had like 15 applications and about 13 driver where running on one 
> machine. While trying to submit a new job I got OOM exceptions which took 
> down my cassandra service only to start the driver on the same node where  
> all the other 13 drivers where running.
> 
> Thx and best regards
> Felix
> 
> 
> Felix Massem | IT-Consultant | Karlsruhe
> mobil: +49 (0) 172.2919848 <>
> 
> www.codecentric.de  | blog.codecentric.de 
>  | www.meettheexperts.de 
>  | www.more4fi.de 
> 
> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
> 
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
> Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. 
> beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht 
> gestattet.
> 
>> Am 28.06.2016 um 17:55 schrieb Mich Talebzadeh > >:
>> 
>> Hi Felix,
>> 
>> In Yarn-cluster mode the resource manager Yarn is expected to take care of 
>> that.
>> 
>> Are you getting some skewed distribution with drivers created through 
>> spark-submit on different nodes?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>> 
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>> 
>> 
>> On 28 June 2016 at 16:06, Felix Massem > > wrote:
>> Hey Mich,
>> 
>> thx for the fast reply.
>> 
>> We are using it in cluster mode and spark version 1.5.2
>> 
>> Greets Felix
>> 
>> 
>> Felix Massem | IT-Consultant | Karlsruhe
>> mobil: +49 (0) 172.2919848 <>
>> 
>> www.codecentric.de  | blog.codecentric.de 
>>  | www.meettheexperts.de 
>>  | www.more4fi.de 
>> 
>> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
>> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
>> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
>> 
>> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
>> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
>> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
>> bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
>> Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. 
>> beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht 
>> gestattet.
>> 
>>> Am 28.06.2016 um 17:04 schrieb Mich Talebzadeh 

Spark sql dataframe

2016-06-29 Thread pooja mehta
Hi,

Want to add a metadata field to StructField case class in spark.

case class StructField(name: String)

And how to carry over the metadata in query execution.


[no subject]

2016-06-29 Thread pooja mehta
Hi,

Want to add a metadata field to StructField case class in spark.

case class StructField(name: String)

And how to carry over the metadata in query execution.


Re: Running into issue using SparkIMain

2016-06-29 Thread Jayant Shekhar
Hello,

Found a workaround to it. Installed scala and added the scala jars to the
classpath before starting the web application.

Now it works smoothly - just that it adds an extra step for the users to do.

Would next look into making it work with the scala jar files contained in
the war.

Thx

On Mon, Jun 27, 2016 at 5:53 PM, Jayant Shekhar 
wrote:

> I tried setting the classpath explicitly in the settings. Classpath gets
> printed properly, it has the scala jars in it like
> scala-compiler-2.10.4.jar, scala-library-2.10.4.jar.
>
> It did not help. Still runs great with IntelliJ, but runs into issues when
> running from the command line.
>
> val cl = this.getClass.getClassLoader
>
> val urls = cl match {
>
>   case cl: java.net.URLClassLoader => cl.getURLs.toList
>
>   case a => sys.error("oops: I was expecting an URLClassLoader, found
> a " + a.getClass)
>
> }
>
> val classpath = urls map {_.toString}
>
> println("classpath=" + classpath);
>
> settings.classpath.value =
> classpath.distinct.mkString(java.io.File.pathSeparator)
>
> settings.embeddedDefaults(cl)
>
>
> -Jayant
>
>
> On Mon, Jun 27, 2016 at 3:19 PM, Jayant Shekhar 
> wrote:
>
>> Hello,
>>
>> I'm trying to run scala code in  a Web Application.
>>
>> It runs great when I am running it in IntelliJ
>> Run into error when I run it from the command line.
>>
>> Command used to run
>> --
>>
>> java -Dscala.usejavacp=true  -jar target/XYZ.war 
>> --spring.config.name=application,db,log4j
>> --spring.config.location=file:./conf/history
>>
>> Error
>> ---
>>
>> Failed to initialize compiler: object scala.runtime in compiler mirror
>> not found.
>>
>> ** Note that as of 2.8 scala does not assume use of the java classpath.
>>
>> ** For the old behavior pass -usejavacp to scala, or if using a Settings
>>
>> ** object programatically, settings.usejavacp.value = true.
>>
>> 16/06/27 15:12:02 WARN SparkIMain: Warning: compiler accessed before init
>> set up.  Assuming no postInit code.
>>
>>
>> I'm also setting the following:
>> 
>>
>> val settings = new Settings()
>>
>>  settings.embeddedDefaults(Thread.currentThread().getContextClassLoader())
>>
>>  settings.usejavacp.value = true
>>
>> Any pointers to the solution would be great.
>>
>> Thanks,
>> Jayant
>>
>>
>


Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Jörn Franke

I think the TEZ engine is much more maintained with respect to optimizations 
related to Orc , hive , vectorizing, querying than the mr engine. It will be 
definitely better to use it.
Mr is also deprecated in hive 2.0.
For me it does not make sense to use mr with hive larger than 1.1.

As I said, order by might be inefficient to use (not sure if this has changed). 
You may want to use sort by.

That being said there are many optimizations methods.

> On 29 Jun 2016, at 00:27, Mich Talebzadeh  wrote:
> 
> That is a good point.
> 
> The ORC table property is as follows
> 
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="1")
> 
> which puts each stripe at 256MB
> 
> Just to clarify this is spark running on Hive tables. I don't think the use 
> of TEZ, MR or Spark as execution engines is going to make any difference?
> 
> This is the same query with Hive on MR
> 
> select a.prod_id from sales2 a, sales_staging b where a.prod_id = b.prod_id 
> order by a.prod_id;
> 
> 2016-06-28 23:23:51,203 Stage-1 map = 0%,  reduce = 0%
> 2016-06-28 23:23:59,480 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 7.32 
> sec
> 2016-06-28 23:24:08,771 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 18.21 
> sec
> 2016-06-28 23:24:11,860 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU 22.34 
> sec
> 2016-06-28 23:24:18,021 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 30.33 
> sec
> 2016-06-28 23:24:21,101 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 33.45 
> sec
> 2016-06-28 23:24:24,181 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 37.5 
> sec
> 2016-06-28 23:24:27,270 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 42.0 
> sec
> 2016-06-28 23:24:30,349 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 45.62 
> sec
> 2016-06-28 23:24:33,441 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 49.69 
> sec
> 2016-06-28 23:24:36,521 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 52.92 
> sec
> 2016-06-28 23:24:39,605 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 56.78 
> sec
> 2016-06-28 23:24:42,686 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 60.36 
> sec
> 2016-06-28 23:24:45,767 Stage-1 map = 81%,  reduce = 0%, Cumulative CPU 63.68 
> sec
> 2016-06-28 23:24:48,842 Stage-1 map = 83%,  reduce = 0%, Cumulative CPU 66.92 
> sec
> 2016-06-28 23:24:51,918 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
> 70.18 sec
> 2016-06-28 23:25:52,354 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
> 127.99 sec
> 2016-06-28 23:25:57,494 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 
> 134.64 sec
> 2016-06-28 23:26:57,847 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 
> 141.01 sec
> 
> which basically sits at 67% all day
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 28 June 2016 at 23:07, Jörn Franke  wrote:
>> 
>> 
>> Bzip2 is splittable for text files.
>> 
>> Btw in Orc the question of splittable does not matter because each stripe is 
>> compressed individually.
>> 
>> Have you tried tez? As far as I recall (at least it was in the first version 
>> of Hive) mr uses for order by a single reducer which is a bottleneck.
>> 
>> Do you see some errors in the log file?
>> 
>>> On 28 Jun 2016, at 23:53, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> 
>>> I have a simple join between table sales2 a compressed (snappy) ORC with 22 
>>> million rows and another simple table sales_staging under a million rows 
>>> stored as a text file with no compression.
>>> 
>>> The join is very simple
>>> 
>>>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>>>   val s = HiveContext.table("sales_staging").select("PROD_ID")
>>> 
>>>   val rs = 
>>> s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println)
>>> 
>>> 
>>> Now what is happening is it is sitting on SortMergeJoin operation on 
>>> ZippedPartitionRDD as shown in the DAG diagram below
>>> 
>>> 
>>> 
>>> 
>>> 
>>> And at this rate  only 10% is done and will take for ever to finish :(
>>> 
>>> Stage 3:==> (10 + 2) / 
>>> 200]
>>> 
>>> Ok I understand that zipped files cannot be broken into blocks and 
>>> operations on them cannot be parallelized.
>>> 
>>> Having said that what are the alternatives? Never use compression and live 
>>> with it. I emphasise that any operation on the compressed table itself is 
>>> pretty fast as it is a simple table scan. However, a join between two 
>>> tables 

Re: Set the node the spark driver will be started

2016-06-29 Thread Felix Massem
Hey Mich,

the distribution is like not given. Just right now I have 15 applications and 
all 15 drivers are running on one node. This is just after giving all machines 
a little more memory.
Before I had like 15 applications and about 13 driver where running on one 
machine. While trying to submit a new job I got OOM exceptions which took down 
my cassandra service only to start the driver on the same node where  all the 
other 13 drivers where running.

Thx and best regards
Felix


Felix Massem | IT-Consultant | Karlsruhe
mobil: +49 (0) 172.2919848 <>

www.codecentric.de  | blog.codecentric.de 
 | www.meettheexperts.de 
 | www.more4fi.de 

Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. beigefügter 
Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet.

> Am 28.06.2016 um 17:55 schrieb Mich Talebzadeh :
> 
> Hi Felix,
> 
> In Yarn-cluster mode the resource manager Yarn is expected to take care of 
> that.
> 
> Are you getting some skewed distribution with drivers created through 
> spark-submit on different nodes?
> 
> HTH
> 
> Dr Mich Talebzadeh
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
> 
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
> 
> 
> On 28 June 2016 at 16:06, Felix Massem  > wrote:
> Hey Mich,
> 
> thx for the fast reply.
> 
> We are using it in cluster mode and spark version 1.5.2
> 
> Greets Felix
> 
> 
> Felix Massem | IT-Consultant | Karlsruhe
> mobil: +49 (0) 172.2919848 <>
> 
> www.codecentric.de  | blog.codecentric.de 
>  | www.meettheexperts.de 
>  | www.more4fi.de 
> 
> Sitz der Gesellschaft: Düsseldorf | HRB 63043 | Amtsgericht Düsseldorf
> Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
> Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz
> 
> Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche 
> und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige 
> Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie 
> bitte sofort den Absender und löschen Sie diese E-Mail und evtl. beigefügter 
> Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen evtl. 
> beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist nicht 
> gestattet.
> 
>> Am 28.06.2016 um 17:04 schrieb Mich Talebzadeh > >:
>> 
>> Hi Felix,
>> 
>> what version of Spark?
>> 
>> Are you using yarn client mode or cluster mode?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>> 
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>> 
>> 
>> On 28 June 2016 at 15:27, adaman79 > > wrote:
>> Hey guys,
>> 
>> I have a problem with memory because over 90% of my spark driver will be
>> started on one of my nine spark nodes.
>> So now I am looking for the possibility to define the node the spark driver
>> will be started when using spark-submit or setting it somewhere in the code.
>> 
>> Is this possible? Does anyone else have this kind of problem?
>> 
>> thx and best 

Job aborted due to not serializable exception

2016-06-29 Thread Paolo Patierno
Hi, 
following the socketStream[T] function implementation from the official spark 
GitHub repo :

ef socketStream[T](
  
  
  hostname: String,
  
  
  port: Int,
  
  
  converter: JFunction[InputStream, java.lang.Iterable[T]],
  
  
  storageLevel: StorageLevel)
  
  
  : JavaReceiverInputDStream[T] = {
  
  
def fn: (InputStream) => Iterator[T] = (x: InputStream) => 
converter.call(x).iterator().asScala
  
  
implicit val cmt: ClassTag[T] =
  
  
  implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[T]]
  
  
ssc.socketStream(hostname, port, fn, storageLevel)
  
  
  }

I'm implementing a custom receiver that works great with used in Scala.
I'm trying to use it from Java and the createStream in MyReceiverUtils.scala is 
the following :

def createStream[T](
  jssc: JavaStreamingContext,
  host: String,
  port: Int,
  address: String,
  messageConverter: Function[Message, Option[T]],
  storageLevel: StorageLevel
): JavaReceiverInputDStream[T] = {


def fn: (Message) => Option[T] = (x: Message) => messageConverter.call(x)
implicit val cmt: ClassTag[T] =
  implicitly[ClassTag[AnyRef]].asInstanceOf[ClassTag[T]]
new MyInputDStream(jssc.ssc, host, port, address, fn, storageLevel)
  }

Trying to use it I receive :

org.apache.spark.SparkException: Job aborted due to stage failure: Failed to 
serialize task 465, not attempting to retry it. Exception during serialization: 
java.io.NotSerializableException: 
org.apache.spark.streaming.amqp.JavaMyReceiverStreamSuite

If I change the fn definition with something simpler like (x: Message) => None 
for example, the error goes away. 

Why the call on messageConverter is producing this problem ?

Thanks,
Paolo

  

Driver zombie process (standalone cluster)

2016-06-29 Thread Tomer Benyamini
Hi,

I'm trying to run spark applications on a standalone cluster, running on
top of AWS. Since my slaves are spot instances, in some cases they are
being killed and lost due to bid prices. When apps are running during this
event, sometimes the spark application dies - and the driver process just
hangs, and stays up forever (zombie process), capturing memory / cpu
resources on the master machine. Then we have to manually kill -9 to free
these resources.

Has anyone seen this kind of problem before? Any suggested solution to work
around this problem?

Thanks,
Tomer


Re: Best practice for handing tables between pipeline components

2016-06-29 Thread Chanh Le
Hi Everett,
We are using Alluxio for the last 2 months. We implement Alluxio for sharing 
data each Spark Job, isolated Spark only for process layer and Alluxio for the 
storage layer.



> On Jun 29, 2016, at 2:52 AM, Everett Anderson  
> wrote:
> 
> Thanks! Alluxio looks quite promising, but also quite new.
> 
> What did people do before?
> 
> On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang  > wrote:
> Yes, Alluxio (http://www.alluxio.org/ ) can be used 
> to store data in-memory between stages in a pipeline.
> 
> Here is more information about running Spark with Alluxio: 
> http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html 
> 
> 
> Hope that helps,
> Gene
> 
> On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu 
> > wrote:
> Alluxio off heap memory would help to share cached objects
> 
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson  
> wrote:
> Hi,
> 
> We have a pipeline of components strung together via Airflow running on AWS. 
> Some of them are implemented in Spark, but some aren't. Generally they can 
> all talk to a JDBC/ODBC end point or read/write files from S3.
> 
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or 
> S3 and reading it back in, again, in every component, if it could stay cached 
> in memory in a Spark cluster. 
> 
> Our current investigation seems to lead us towards exploring if the following 
> things are possible:
> Using a Hive metastore with S3 as its backing data store to try to keep a 
> mapping from table name to files on S3 (not sure if one can cache a Hive 
> table in Spark across contexts, though)
> Using something like the spark-jobserver to keep a Spark SQLContext open 
> across Spark components so they could avoid file I/O for cached tables
> What's the best practice for handing tables between Spark programs? What 
> about between Spark and non-Spark programs?
> 
> Thanks!
> 
> - Everett
> 
> 
>