Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Prem Sure
Yes Nirav, we can probably request dev for a config param enablement to
take care of this automatically (internally) - additional care required
while specifying column names and joining from users

Thanks,
Prem

On Thu, Jul 12, 2018 at 10:53 PM Nirav Patel  wrote:

> Hi Prem, dropping column, renaming column are working for me as a
> workaround. I thought it just nice to have generic api that can handle that
> for me. or some intelligence that since both columns are same it shouldn't
> complain in subsequent Select clause that it doesn't know if I mean a#12 or
> a#81. They are both same just pick one.
>
> On Thu, Jul 12, 2018 at 9:38 AM, Prem Sure  wrote:
>
>> Hi Nirav, did you try
>> .drop(df1(a) after join
>>
>> Thanks,
>> Prem
>>
>> On Thu, Jul 12, 2018 at 9:50 PM Nirav Patel 
>> wrote:
>>
>>> Hi Vamshi,
>>>
>>> That api is very restricted and not generic enough. It imposes that all
>>> conditions of joins has to have same column on both side and it also has to
>>> be equijoin. It doesn't serve my usecase where some join predicates don't
>>> have same column names.
>>>
>>> Thanks
>>>
>>> On Sun, Jul 8, 2018 at 7:39 PM, Vamshi Talla 
>>> wrote:
>>>
>>>> Nirav,
>>>>
>>>> Spark does not create a duplicate column when you use the below join
>>>> expression,  as an array of column(s) like below but that requires the
>>>> column name to be same in both the data frames.
>>>>
>>>> Example: *df1.join(df2, [‘a’])*
>>>>
>>>> Thanks.
>>>> Vamshi Talla
>>>>
>>>> On Jul 6, 2018, at 4:47 PM, Gokula Krishnan D 
>>>> wrote:
>>>>
>>>> Nirav,
>>>>
>>>> withColumnRenamed() API might help but it does not different column and
>>>> renames all the occurrences of the given column. either use select() API
>>>> and rename as you want.
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>> Gokula Krishnan* (Gokul)*
>>>>
>>>> On Mon, Jul 2, 2018 at 5:52 PM, Nirav Patel 
>>>> wrote:
>>>>
>>>>> Expr is `df1(a) === df2(a) and df1(b) === df2(c)`
>>>>>
>>>>> How to avoid duplicate column 'a' in result? I don't see any api that
>>>>> combines both. Rename manually?
>>>>>
>>>>>
>>>>>
>>>>> [image: What's New with Xactly]
>>>>> <https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.xactlycorp.com%2Femail-click%2F=02%7C01%7C%7C8ab8d95c23f44dfb156708d5e381c938%7C84df9e7fe9f640afb435%7C1%7C0%7C636665068928877949=p4D%2FKz%2B%2Fd8wWFg9EGtNMRNcnYk5LlZmjQKx0TeWleDE%3D=0>
>>>>>
>>>>>
>>>>> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fxactlycorp%2F=02%7C01%7C%7C8ab8d95c23f44dfb156708d5e381c938%7C84df9e7fe9f640afb435%7C1%7C0%7C636665068929034245=wtbLs3%2FABfsz8b1vN46EOcI22VZE1T5bhqOi9l1NFT0%3D=0>
>>>>>
>>>>> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fxactly-corporation=02%7C01%7C%7C8ab8d95c23f44dfb156708d5e381c938%7C84df9e7fe9f640afb435%7C1%7C0%7C636665068929034245=vyQkePM9Y3zG94CKUFJNtuAcEk6M60AtvhOjsHxBhbY%3D=0>
>>>>>
>>>>> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2FXactly=02%7C01%7C%7C8ab8d95c23f44dfb156708d5e381c938%7C84df9e7fe9f640afb435%7C1%7C0%7C636665068929034245=tRidhL1X4x4TPWdUHfH8%2Bcw8r7MT9jrRh1%2BJyU0LGCg%3D=0>
>>>>>
>>>>> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2FXactlyCorp=02%7C01%7C%7C8ab8d95c23f44dfb156708d5e381c938%7C84df9e7fe9f640afb435%7C1%7C0%7C636665068929034245=kh0aKmjvcG1ox5%2FMjdI5Ib%2FMvTu4xomFPLUcWDyBir8%3D=0>
>>>>>
>>>>> <https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.youtube.com%2Fxactlycorporation=02%7C01%7C%7C8ab8d95c23f44dfb156708d5e381c938%7C84df9e7fe9f640afb435%7C1%7C0%7C636665068929034245=sicYYnUCmLBbOnUpu2v3Mp7btkt%2FEGWVMHHC%2BqFIenE%3D=0>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>>
>>> <https://www.instagram.com/xactlycorp/>
>>> <https://www.linkedin.com/company/xactly-corporation>
>>> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>
>>> <http://www.youtube.com/xactlycorporation>
>>
>>
>
>
>
> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>
> <https://www.instagram.com/xactlycorp/>
> <https://www.linkedin.com/company/xactly-corporation>
> <https://twitter.com/Xactly>   <https://www.facebook.com/XactlyCorp>
> <http://www.youtube.com/xactlycorporation>


Re: How to avoid duplicate column names after join with multiple conditions

2018-07-12 Thread Prem Sure
Hi Nirav, did you try
.drop(df1(a) after join

Thanks,
Prem

On Thu, Jul 12, 2018 at 9:50 PM Nirav Patel  wrote:

> Hi Vamshi,
>
> That api is very restricted and not generic enough. It imposes that all
> conditions of joins has to have same column on both side and it also has to
> be equijoin. It doesn't serve my usecase where some join predicates don't
> have same column names.
>
> Thanks
>
> On Sun, Jul 8, 2018 at 7:39 PM, Vamshi Talla  wrote:
>
>> Nirav,
>>
>> Spark does not create a duplicate column when you use the below join
>> expression,  as an array of column(s) like below but that requires the
>> column name to be same in both the data frames.
>>
>> Example: *df1.join(df2, [‘a’])*
>>
>> Thanks.
>> Vamshi Talla
>>
>> On Jul 6, 2018, at 4:47 PM, Gokula Krishnan D 
>> wrote:
>>
>> Nirav,
>>
>> withColumnRenamed() API might help but it does not different column and
>> renames all the occurrences of the given column. either use select() API
>> and rename as you want.
>>
>>
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>>
>> On Mon, Jul 2, 2018 at 5:52 PM, Nirav Patel 
>> wrote:
>>
>>> Expr is `df1(a) === df2(a) and df1(b) === df2(c)`
>>>
>>> How to avoid duplicate column 'a' in result? I don't see any api that
>>> combines both. Rename manually?
>>>
>>>
>>>
>>> [image: What's New with Xactly]
>>> 
>>>
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>
>>
>>
>>
>
>
>
> [image: What's New with Xactly] 
>
> 
> 
>    
> 


Re: how to specify external jars in program with SparkConf

2018-07-12 Thread Prem Sure
I think JVM is initiated with available classpath by the time your conf
execution comes... I faced this earlier during Spark1.6 and ended up moving
to Spark Submit using --jars
found it was not part of runtime config changes..
May I know the advantage you are trying to get programmatically

On Thu, Jul 12, 2018 at 8:19 PM, mytramesh 
wrote:

> Context :- In EMR class path has old version of jar, want to refer new
> version of jar in my code.
>
> through bootstrap while spinning new nodes , copied necessary jars to local
> folder from S3.
>
> In spark-submit command by using extra class path parameter my code able
> refer new version jar which is in custom location .
>
> --conf="spark.driver.extraClassPath=/usr/jars/*"
> --conf="spark.executor.extraClassPath=/usr/jars/*"
>
> Same thing want to achieve programmatically by specifying in sparkconfig
> object, but no luck . Could anyone help me on this .
>
> sparkConf.set("spark.driver.extraClassPath", "/usr/jars/*");
> sparkConf.set("spark.executor.extraClassPath", "/usr/jars/*");
> //tried below options also
> //sparkConf.set("spark.executor.userClassPathFirst", "true");
>  //sparkConf.set("spark.driver.userClassPathFirst", "true");
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Prem Sure
try .pipe(.py) on RDD

Thanks,
Prem

On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri 
wrote:

> Can someone please suggest me , thanks
>
> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, 
> wrote:
>
>> Hello Dear Spark User / Dev,
>>
>> I would like to pass Python user defined function to Spark Job developed
>> using Scala and return value of that function would be returned to DF /
>> Dataset API.
>>
>> Can someone please guide me, which would be best approach to do this.
>> Python function would be mostly transformation function. Also would like to
>> pass Java Function as a String to Spark / Scala job and it applies to RDD /
>> Data Frame and should return RDD / Data Frame.
>>
>> Thank you.
>>
>>
>>
>>


Re: Inferring Data driven Spark parameters

2018-07-04 Thread Prem Sure
Can you share the API that your jobs use.. just core RDDs or SQL or
DStreams..etc?
refer  recommendations from
https://spark.apache.org/docs/2.3.0/configuration.html for detailed
configurations.
Thanks,
Prem

On Wed, Jul 4, 2018 at 12:34 PM, Aakash Basu 
wrote:

> I do not want to change executor/driver cores/memory on the fly in a
> single Spark job, all I want is to make them cluster specific. So, I want
> to have a formulae, with which, depending on the size of driver and
> executor details, I can find out the values for them before submitting
> those details in the spark-submit.
>
> I, more or less know how to achieve the above as I've previously done that.
>
> All I need to do is, I want to tweak the other spark confs depending on
> the data. Is that possible? I mean (just an example), if I have 100+
> features, I want to double my default spark.driver.maxResultSize to 2G, and
> similarly for other configs. Can that be achieved by any means for a
> optimal run on that kind of dataset? If yes, can I?
>
> On Tue, Jul 3, 2018 at 6:28 PM, Vadim Semenov  wrote:
>
>> You can't change the executor/driver cores/memory on the fly once
>> you've already started a Spark Context.
>> On Tue, Jul 3, 2018 at 4:30 AM Aakash Basu 
>> wrote:
>> >
>> > We aren't using Oozie or similar, moreover, the end to end job shall be
>> exactly the same, but the data will be extremely different (number of
>> continuous and categorical columns, vertical size, horizontal size, etc),
>> hence, if there would have been a calculation of the parameters to arrive
>> at a conclusion that we can simply get the data and derive the respective
>> configuration/parameters, it would be great.
>> >
>> > On Tue, Jul 3, 2018 at 1:09 PM, Jörn Franke 
>> wrote:
>> >>
>> >> Don’t do this in your job. Create for different types of jobs
>> different jobs and orchestrate them using oozie or similar.
>> >>
>> >> On 3. Jul 2018, at 09:34, Aakash Basu 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> Cluster - 5 node (1 Driver and 4 workers)
>> >> Driver Config: 16 cores, 32 GB RAM
>> >> Worker Config: 8 cores, 16 GB RAM
>> >>
>> >> I'm using the below parameters from which I know the first chunk is
>> cluster dependent and the second chunk is data/code dependent.
>> >>
>> >> --num-executors 4
>> >> --executor-cores 5
>> >> --executor-memory 10G
>> >> --driver-cores 5
>> >> --driver-memory 25G
>> >>
>> >>
>> >> --conf spark.sql.shuffle.partitions=100
>> >> --conf spark.driver.maxResultSize=2G
>> >> --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
>> >> --conf spark.scheduler.listenerbus.eventqueue.capacity=2
>> >>
>> >> I've come upto these values depending on my R on the properties and
>> the issues I faced and hence the handles.
>> >>
>> >> My ask here is -
>> >>
>> >> 1) How can I infer, using some formula or a code, to calculate the
>> below chunk dependent on the data/code?
>> >> 2) What are the other usable properties/configurations which I can use
>> to shorten my job runtime?
>> >>
>> >> Thanks,
>> >> Aakash.
>> >
>> >
>>
>>
>> --
>> Sent from my iPhone
>>
>
>


Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread Prem Sure
Hoping below would help in clearing some..
executors dont have control to share the data among themselves except
sharing accumulators via driver's support.
Its all based on the data locality or remote nature, tasks/stages are
defined to perform which may result in shuffle.

On Wed, Jul 4, 2018 at 1:56 PM, thomas lavocat <
thomas.lavo...@univ-grenoble-alpes.fr> wrote:

> Hello,
>
> I have a question on Spark Dataflow. If I understand correctly, all
> received data is sent from the executor to the driver of the application
> prior to task creation.
>
> Then the task embeding the data transit from the driver to the executor in
> order to be processed.
>
> As executor cannot exchange data themselves, in a shuffle, data also
> transit to the driver.
>
> Is that correct ?
>
> Thomas
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to set spark.driver.memory?

2018-06-19 Thread Prem Sure
Hi, Can you share the exception?
You need to give the value as well right after --driver-memory. First
preference goes to the config keyval pairs defined in spark-submit and then
only to spark-defaults.con.
You can refer docs for the exact variable name

Thanks,
Prem

On Tue, Jun 19, 2018 at 5:47 PM onmstester onmstester 
wrote:

> I have a spark cluster containing 3 nodes and my application is a jar file
> running by java -jar .
> How can i set driver.memory for my application?
> spark-defaults.conf only would be read by ./spark-summit
> "java --driver-memory -jar " fails with exception.
>
> Sent using Zoho Mail 
>
>
>


Re: Spark application fail wit numRecords error

2017-11-01 Thread Prem Sure
Hi, any offset left over for new topic consumption?, case can be the offset
is beyond current latest offset and cuasing negative.
hoping kafka brokers health is good and are up, this can also be a reason
sometimes.

On Wed, Nov 1, 2017 at 11:40 AM, Serkan TAS  wrote:

> Hi,
>
>
>
> I searched the error in kafka but i think at last, it is related with
> spark not kafka.
>
>
>
> Has anyone faced to an exception that is terminating program with error
> "numRecords must not be negative" while streaming  ?
>
>
>
> Thanx in advance.
>
>
>
> Regards.
>
> --
>
> Bu ileti hukuken korunmuş, gizli veya ifşa edilmemesi gereken bilgiler
> içerebilir. Şayet mesajın gönderildiği kişi değilseniz, bu iletiyi
> çoğaltmak ve dağıtmak yasaktır. Bu mesajı yanlışlıkla alan kişi, bu durumu
> derhal gönderene telefonla ya da e-posta ile bildirmeli ve bilgisayarından
> silmelidir. Bu iletinin içeriğinden yalnızca iletiyi gönderen kişi
> sorumludur.
>
> This communication may contain information that is legally privileged,
> confidential or exempt from disclosure. If you are not the intended
> recipient, please note that any dissemination, distribution, or copying of
> this communication is strictly prohibited. Anyone who receives this message
> in error should notify the sender immediately by telephone or by return
> communication and delete it from his or her computer. Only the person who
> has sent this message is responsible for its content.
>


Re: Error using collectAsMap() in scala

2016-03-20 Thread Prem Sure
any specific reason you would like to use collectasmap only? You probably
move to normal RDD instead of a Pair.


On Monday, March 21, 2016, Mark Hamstra  wrote:

> You're not getting what Ted is telling you.  Your `dict` is an RDD[String]
>  -- i.e. it is a collection of a single value type, String.  But
> `collectAsMap` is only defined for PairRDDs that have key-value pairs for
> their data elements.  Both a key and a value are needed to collect into a
> Map[K, V].
>
> On Sun, Mar 20, 2016 at 8:19 PM, Shishir Anshuman <
> shishiranshu...@gmail.com
> > wrote:
>
>> yes I have included that class in my code.
>> I guess its something to do with the RDD format. Not able to figure out
>> the exact reason.
>>
>> On Fri, Mar 18, 2016 at 9:27 AM, Ted Yu > > wrote:
>>
>>> It is defined in:
>>> core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
>>>
>>> On Thu, Mar 17, 2016 at 8:55 PM, Shishir Anshuman <
>>> shishiranshu...@gmail.com
>>> > wrote:
>>>
 I am using following code snippet in scala:


 *val dict: RDD[String] = sc.textFile("path/to/csv/file")*
 *val dict_broadcast=sc.broadcast(dict.collectAsMap())*

 On compiling It generates this error:

 *scala:42: value collectAsMap is not a member of
 org.apache.spark.rdd.RDD[String]*


 *val dict_broadcast=sc.broadcast(dict.collectAsMap())
   ^*

>>>
>>>
>>
>


Re: Spark Certification

2016-02-11 Thread Prem Sure
I did recently. it includes MLib & Graphx too and I felt like exam content
covered all topics till 1.3 and not the > 1.3 versions of spark.


On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri 
wrote:

> I am planning to do that with databricks
> http://go.databricks.com/spark-certified-developer
>
> Regards,
> Janardhan
>
> On Thu, Feb 11, 2016 at 2:00 PM, Timothy Spann 
> wrote:
>
>> I was wondering that as well.
>>
>> Also is it fully updated for 1.6?
>>
>> Tim
>> http://airisdata.com/
>> http://sparkdeveloper.com/
>>
>>
>> From: naga sharathrayapati 
>> Date: Wednesday, February 10, 2016 at 11:36 PM
>> To: "user@spark.apache.org" 
>> Subject: Spark Certification
>>
>> Hello All,
>>
>> I am planning on taking Spark Certification and I was wondering If one
>> has to be well equipped with  MLib & GraphX as well or not ?
>>
>> Please advise
>>
>> Thanks
>>
>
>


Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure
 try mapPartitionsWithIndex .. below is an example I used earlier. myfunc
logic can be further modified as per your need.
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
  iter.toList.map(x => index + "," + x).iterator
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9)

On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D 
wrote:

> Hello All -
>
> I'm just trying to understand aggregate() and in the meantime got an
> question.
>
> *Is there any way to view the RDD databased on the partition ?.*
>
> For the instance, the following RDD has 2 partitions
>
> val multi2s = List(2,4,6,8,10,12,14,16,18,20)
> val multi2s_RDD = sc.parallelize(multi2s,2)
>
> is there anyway to view the data based on the partitions (0,1).
>
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>


Re: How to view the RDD data based on Partition

2016-01-12 Thread Prem Sure
I had explored these examples couple of months back. very good link for RDD
operations. see if below explanation helps, try to understand the
difference between below 2 examples.. initial value in both is """
Example 1;
val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x
+ y)
res144: String = 11

Partition 1 ("12", "23")
("","12") => "0" . here  "" is the initial value
("0","23") => "1" -- here above 0 is used again for the next element with
in the same partition

Partition 2 ("","345")
("","") => "0"   -- resulting length is 0
("0","345") => "1"  -- zero is again used and min length becomes 1

Final merge:
("1","1") => "11"

Example 2:
val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x
+ y)
res143: String = 10

Partition 1 ("12", "23")
("","12") => "0" . here  "" is the initial value
("0","23") => "1" -- here above 0 is used again for the next element with
in the same partition

Partition 2 ("345","")
("","345") => "0"   -- resulting length is 0
("0","") => "0"  -- min length becomes zero again.

Final merge:
("1","0") => "10"

Hope this helps


On Tue, Jan 12, 2016 at 2:53 PM, Gokula Krishnan D <email2...@gmail.com>
wrote:

> Hello Prem -
>
> Thanks for sharing and I also found the similar example from the link
> http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate
>
>
> But trying the understand the actual functionality or behavior.
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>
> On Tue, Jan 12, 2016 at 2:50 PM, Prem Sure <premsure...@gmail.com> wrote:
>
>>  try mapPartitionsWithIndex .. below is an example I used earlier. myfunc
>> logic can be further modified as per your need.
>> val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
>> def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
>>   iter.toList.map(x => index + "," + x).iterator
>> }
>> x.mapPartitionsWithIndex(myfunc).collect()
>> res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9)
>>
>> On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D <email2...@gmail.com>
>> wrote:
>>
>>> Hello All -
>>>
>>> I'm just trying to understand aggregate() and in the meantime got an
>>> question.
>>>
>>> *Is there any way to view the RDD databased on the partition ?.*
>>>
>>> For the instance, the following RDD has 2 partitions
>>>
>>> val multi2s = List(2,4,6,8,10,12,14,16,18,20)
>>> val multi2s_RDD = sc.parallelize(multi2s,2)
>>>
>>> is there anyway to view the data based on the partitions (0,1).
>>>
>>>
>>> Thanks & Regards,
>>> Gokula Krishnan* (Gokul)*
>>>
>>
>>
>


Re: Create a n x n graph given only the vertices no

2016-01-10 Thread Prem Sure
you mean with out edges data? I dont think so. The other-way is
possible..by calling fromEdges on Graph (this would assign vertices
mentioned by edges default value ). please share your need/requirement in
detail if possible..



On Sun, Jan 10, 2016 at 10:19 PM, praveen S  wrote:

> Is it possible in graphx to create/generate graph of n x n given only the
> vertices.
> On 8 Jan 2016 23:57, "praveen S"  wrote:
>
>> Is it possible in graphx to create/generate a graph n x n given n
>> vertices?
>>
>


Re: Spark job uses only one Worker

2016-01-08 Thread Prem Sure
to narrow down,you can try below
1) is the job going to same node everytime( when you execute job multiple
times)?. enable property spark.speculation, keep thread.sleep for 2 mins
and see if the job is going to a different worker from the executor posted
on initially. ( trying to find, there are no connection or setup related
issue)
2) whats your spark.executor.memory. try decreasing executor memory to a
value less than data size and if that helps in distributing.
3 While launching the cluster, play around with with number of slaves -
start with 1
./spark-ec2 -k  -i  -s  launch 

On Fri, Jan 8, 2016 at 2:53 PM, Michael Pisula 
wrote:

> Hi Annabel,
>
> I am using Spark in stand-alone mode (deployment using the ec2 scripts
> packaged with spark).
>
> Cheers,
> Michael
>
>
> On 08.01.2016 00:43, Annabel Melongo wrote:
>
> Michael,
>
> I don't know what's your environment but if it's Cloudera, you should be
> able to see the link to your master in the Hue.
>
> Thanks
>
>
> On Thursday, January 7, 2016 5:03 PM, Michael Pisula
>   wrote:
>
>
> I had tried several parameters, including --total-executor-cores, no
> effect.
> As for the port, I tried 7077, but if I remember correctly I got some kind
> of error that suggested to try 6066, with which it worked just fine (apart
> from this issue here).
>
> Each worker has two cores. I also tried increasing cores, again no effect.
> I was able to increase the number of cores the job was using on one worker,
> but it would not use any other worker (and it would not start if the number
> of cores the job wanted was higher than the number available on one worker).
>
> On 07.01.2016 22:51, Igor Berman wrote:
>
> read about *--total-executor-cores*
> not sure why you specify port 6066 in master...usually it's 7077
> verify in master ui(usually port 8080) how many cores are there(depends on
> other configs, but usually workers connect to master with all their cores)
>
> On 7 January 2016 at 23:46, Michael Pisula 
> wrote:
>
> Hi,
>
> I start the cluster using the spark-ec2 scripts, so the cluster is in
> stand-alone mode.
> Here is how I submit my job:
> spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master
> spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
>
> Cheers,
> Michael
>
>
> On 07.01.2016 22:41, Igor Berman wrote:
>
> share how you submit your job
> what cluster(yarn, standalone)
>
> On 7 January 2016 at 23:24, Michael Pisula < 
> michael.pis...@tngtech.com> wrote:
>
> Hi there,
>
> I ran a simple Batch Application on a Spark Cluster on EC2. Despite having
> 3
> Worker Nodes, I could not get the application processed on more than one
> node, regardless if I submitted the Application in Cluster or Client mode.
> I also tried manually increasing the number of partitions in the code, no
> effect. I also pass the master into the application.
> I verified on the nodes themselves that only one node was active while the
> job was running.
> I pass enough data to make the job take 6 minutes to process.
> The job is simple enough, reading data from two S3 files, joining records
> on
> a shared field, filtering out some records and writing the result back to
> S3.
>
> Tried all kinds of stuff, but could not make it work. I did find similar
> questions, but had already tried the solutions that worked in those cases.
> Would be really happy about any pointers.
>
> Cheers,
> Michael
>
>
>
> --
> View this message in context:
> 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: 
> user-unsubscr...@spark.apache.org
> For additional commands, e-mail: 
> user-h...@spark.apache.org
>
>
>
> --
> Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
>
> --
> Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>
>
>
> --
> Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>
>


Re: adding jars - hive on spark cdh 5.4.3

2016-01-07 Thread Prem Sure
did you try -- jars property in spark submit? if your jar is of huge size,
you can pre-load the jar on all executors in a common available directory
to avoid network IO.

On Thu, Jan 7, 2016 at 4:03 PM, Ophir Etzion  wrote:

> I' trying to add jars before running a query using hive on spark on cdh
> 5.4.3.
> I've tried applying the patch in
> https://issues.apache.org/jira/browse/HIVE-12045 (manually as the patch
> is done on a different hive version) but still hasn't succeeded.
>
> did anyone manage to do ADD JAR successfully with CDH?
>
> Thanks,
> Ophir
>


Re: Problems with reading data from parquet files in a HDFS remotely

2016-01-07 Thread Prem Sure
you many need to add

createDataFrame( for Python, inferschema) call before registerTempTable.

Thanks,

Prem


On Thu, Jan 7, 2016 at 12:53 PM, Henrik Baastrup <
henrik.baast...@netscout.com> wrote:

> Hi All,
>
> I have a small Hadoop cluster where I have stored a lot of data in parquet 
> files. I have installed a Spark master service on one of the nodes and now 
> would like to query my parquet files from a Spark client. When I run the 
> following program from the spark-shell on the Spark Master node all function 
> correct:
>
> # val sqlCont = new org.apache.spark.sql.SQLContext(sc)
> # val reader = sqlCont.read
> # val dataFrame = reader.parquet("/user/hdfs/parquet-multi/BICC")
> # dataFrame.registerTempTable("BICC")
> # val recSet = sqlCont.sql("SELECT 
> protocolCode,beginTime,endTime,called,calling FROM BICC WHERE 
> endTime>=14494218 AND endTime<=14494224 AND 
> calling='6287870642893' AND p_endtime=14494224")
> # recSet.show()
>
> But when I run the Java program below, from my client, I get:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed: No 
> predefined schema found, and no Parquet data files or summary files found 
> under file:/user/hdfs/parquet-multi/BICC.
>
> The exception occurs at the line: DataFrame df = 
> reader.parquet("/user/hdfs/parquet-multi/BICC");
>
> On the Master node I can see the client connect when the SparkContext is 
> instanced, as I get the following lines in the Spark log:
>
> 16/01/07 18:27:47 INFO Master: Registering app SparkTest
> 16/01/07 18:27:47 INFO Master: Registered app SparkTest with ID 
> app-20160107182747-00801
>
> If I create a local directory with the given path, my program goes in an 
> endless loop, with the following warning on the console:
>
> WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; 
> check your cluster UI to ensure that workers are registered and have 
> sufficient resources
>
> To me it seams that my SQLContext does not connect to the Spark Master, but 
> try to work locally on the client, where the requested files do not exist.
>
> Java program:
>   SparkConf conf = new SparkConf()
>   .setAppName("SparkTest")
>   .setMaster("spark://172.27.13.57:7077");
>   JavaSparkContext sc = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
>   
>   DataFrameReader reader = sqlContext.read();
>   DataFrame df = reader.parquet("/user/hdfs/parquet-multi/BICC");
>   DataFrame filtered = df.filter("endTime>=14494218 AND 
> endTime<=14494224 AND calling='6287870642893' AND 
> p_endtime=14494224");
>   filtered.show();
>
> Are there someone there can help me?
>
> Henrik
>
>
>


Re: Question in rdd caching in memory using persist

2016-01-07 Thread Prem Sure
are you running standalone - local mode or cluster mode. executor and
driver existance differ based on setup type. snapshot of your env UI would
be helpful to say

On Thu, Jan 7, 2016 at 11:51 AM,  wrote:

> Hi,
>
>
>
> After I called rdd.persist(*MEMORY_ONLY_SER*), I see the driver listed as one 
> of the ‘executors’ participating in holding the partitions of the rdd in 
> memory, the memory usage shown against the driver is 0. This I see in the 
> storage tab of the spark ui.
>
> Why is the driver shown on the ui ? Will it ever hold rdd partitions when
> caching.
>
>
>
>
>
> *-regards*
>
> *Seemanto Barua*
>
>
>
>
>
> PLEASE READ: This message is for the named person's use only. It may
> contain confidential, proprietary or legally privileged information. No
> confidentiality or privilege is waived or lost by any mistransmission. If
> you receive this message in error, please delete it and all copies from
> your system, destroy any hard copies and notify the sender. You must not,
> directly or indirectly, use, disclose, distribute, print, or copy any part
> of this message if you are not the intended recipient. Nomura Holding
> America Inc., Nomura Securities International, Inc, and their respective
> subsidiaries each reserve the right to monitor all e-mail communications
> through its networks. Any views expressed in this message are those of the
> individual sender, except where the message states otherwise and the sender
> is authorized to state the views of such entity. Unless otherwise stated,
> any pricing information in this message is indicative only, is subject to
> change and does not constitute an offer to deal at any price quoted. Any
> reference to the terms of executed transactions should be treated as
> preliminary only and subject to our formal written confirmation.
>


Re: sparkR ORC support.

2016-01-05 Thread Prem Sure
Yes Sandeep, also copy hive-site.xml too to spark conf directory.


On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
wrote:

> Also, do I need to setup hive in spark as per the link
> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
> ?
>
> We might need to copy hdfs-site.xml file to spark conf directory ?
>
> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
> wrote:
>
>> Deepak
>>
>> Tried this. Getting this error now
>>
>> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :
>>   unused argument ("")
>>
>>
>> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
>> wrote:
>>
>>> Hi Sandeep
>>> can you try this ?
>>>
>>> results <- sql(hivecontext, "FROM test SELECT id","")
>>>
>>> Thanks
>>> Deepak
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
>>> wrote:
>>>
 Thanks Deepak.

 I tried this as well. I created a hivecontext   with  "hivecontext <<-
 sparkRHive.init(sc) "  .

 When I tried to read hive table from this ,

 results <- sql(hivecontext, "FROM test SELECT id")

 I get below error,

 Error in callJMethod(sqlContext, "sql", sqlQuery) :
   Invalid jobj 2. If SparkR was restarted, Spark operations need to be 
 re-executed.


 Not sure what is causing this? Any leads or ideas? I am using rstudio.



 On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
 wrote:

> Hi Sandeep
> I am not sure if ORC can be read directly in R.
> But there can be a workaround .First create hive table on top of ORC
> files and then access hive table in R.
>
> Thanks
> Deepak
>
> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
> wrote:
>
>> Hello
>>
>> I need to read an ORC files in hdfs in R using spark. I am not able
>> to find a package to do that.
>>
>> Can anyone help with documentation or example for this purpose?
>>
>> --
>> Architect
>> Infoworks.io
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



 --
 Architect
 Infoworks.io
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io
>> http://Infoworks.io
>>
>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>


Exception in thread "main" java.lang.IncompatibleClassChangeError:

2015-12-04 Thread Prem Sure
Getting below exception while executing below program in eclipse.
any clue on whats wrong here would be helpful

*public* *class* WordCount {

*private* *static* *final* FlatMapFunction *WORDS_EXTRACTOR*
=

*new* *FlatMapFunction()* {

@Override

*public* Iterable call(String s) *throws* Exception {

*return* Arrays.*asList*(s.split(" "));

}

};

*private* *static* *final* PairFunction
*WORDS_MAPPER* =

*new* *PairFunction()* {

@Override

*public* Tuple2 call(String s) *throws* Exception {

*return* *new* Tuple2(s, 1);

}

};

*private* *static* *final* Function2
*WORDS_REDUCER* =

*new* *Function2()* {

@Override

*public* Integer call(Integer a, Integer b) *throws* Exception {

*return* a + b;

}

};

*public* *static* *void* main(String[] args) {

SparkConf conf = *new* SparkConf().setAppName("spark.WordCount").setMaster(
"local");

JavaSparkContext *context* = *new* JavaSparkContext(conf);

JavaRDD file = context.textFile("Input/SampleTextFile.txt");

file.saveAsTextFile("file:///Output/WordCount.txt");

JavaRDD words = file.flatMap(*WORDS_EXTRACTOR*);

JavaPairRDD pairs = words.mapToPair(*WORDS_MAPPER*);

JavaPairRDD counter = pairs.reduceByKey(*WORDS_REDUCER*);

counter.foreach(System.*out*::println);

counter.saveAsTextFile("file:///Output/WordCount.txt");

}

}

*Exception in thread "main" java.lang.IncompatibleClassChangeError:
Implementing class*

at java.lang.ClassLoader.defineClass1(*Native Method*)

at java.lang.ClassLoader.defineClass(Unknown Source)

at java.security.SecureClassLoader.defineClass(Unknown Source)

at java.net.URLClassLoader.defineClass(Unknown Source)

at java.net.URLClassLoader.access$100(Unknown Source)

at java.net.URLClassLoader$1.run(Unknown Source)

at java.net.URLClassLoader$1.run(Unknown Source)

at java.security.AccessController.doPrivileged(*Native Method*)

at java.net.URLClassLoader.findClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)

at java.lang.ClassLoader.loadClass(Unknown Source)

at java.lang.Class.forName0(*Native Method*)

at java.lang.Class.forName(Unknown Source)

at org.apache.spark.mapred.SparkHadoopMapRedUtil$class.firstAvailableClass(
*SparkHadoopMapRedUtil.scala:61*)

at org.apache.spark.mapred.SparkHadoopMapRedUtil$class.newJobContext(
*SparkHadoopMapRedUtil.scala:27*)

at org.apache.spark.SparkHadoopWriter.newJobContext(
*SparkHadoopWriter.scala:39*)

at org.apache.spark.SparkHadoopWriter.getJobContext(
*SparkHadoopWriter.scala:149*)

at org.apache.spark.SparkHadoopWriter.preSetup(*SparkHadoopWriter.scala:63*)

at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(
*PairRDDFunctions.scala:1045*)

at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(
*PairRDDFunctions.scala:940*)

at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(
*PairRDDFunctions.scala:849*)

at org.apache.spark.rdd.RDD.saveAsTextFile(*RDD.scala:1164*)

at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(
*JavaRDDLike.scala:443*)

at org.apache.spark.api.java.JavaRDD.saveAsTextFile(*JavaRDD.scala:32*)

at spark.WordCount.main(*WordCount.java:44*)


Re: Automatic driver restart does not seem to be working in Spark Standalone

2015-11-25 Thread Prem Sure
I think automatic driver restart will happen, if driver fails with non-zero
exit code.

  --deploy-mode cluster
  --supervise



On Wed, Nov 25, 2015 at 1:46 PM, SRK  wrote:

> Hi,
>
> I am submitting my Spark job with supervise option as shown below. When I
> kill the driver and the app from UI, the driver does not restart
> automatically. This is in a cluster mode.  Any suggestion on how to make
> Automatic Driver Restart work would be of great help.
>
> --supervise
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Automatic-driver-restart-does-not-seem-to-be-working-in-Spark-Standalone-tp25478.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Queue in Spark standalone mode

2015-11-25 Thread Prem Sure
spark standalone mode submitted applications will run in FIFO
(first-in-first-out) order. please elaborate "strange behavior while
running  multiple jobs simultaneously."

On Wed, Nov 25, 2015 at 2:29 PM, sunil m <260885smanik...@gmail.com> wrote:

> Hi!
>
> I am using Spark 1.5.1 and pretty new to Spark...
>
> Like Yarn, is there a way to configure queues in Spark standalone mode?
> If yes, can someone point me to a good documentation / reference.
>
> Sometimes  I get strange behavior while running  multiple jobs
> simultaneously.
>
> Thanks in advance.
>
> Warm regards,
> Sunil
>


Re: Spark 1.6 Build

2015-11-24 Thread Prem Sure
you can refer..:
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn


On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I'm not able to build Spark 1.6 from source. Could you please share the
> steps to build Spark 1.16
>
> Regards,
> Rajesh
>