date:20160802

Thanks Jared for your kind words. I don't think I am anywhere near there
yet :)

In general I subtract one character before getting to "CD". That is the way
debit from debit cards are marked in a Bank's statement.

I get out of bound error if -->
select(mySubstr($"transactiondescription",lit(0),instr($"transactiondescription",
"CD")-1)   fails with the length. So I did

ll_18740868.where($"transactiontype" === "DEB" && $"transactiondescription"
> "
").select(mySubstr($"transactiondescription",lit(0),instr($"transactiondescription",
"CD")-1),$"debitamount").collect.foreach(println)

which basically examines if the value of $"transactiondescription" > "  "
then do the substring

Now are there better options than that, say make UDF handle the error
when length($"transactiondescription" < 2 or it is null etc and returns
something to avoid the program crashing?

Thanks again for your help.





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 10:57, Jacek Laskowski  wrote:

> Congrats! You made it. A serious Spark dev badge unlocked :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Aug 2, 2016 at 9:58 AM, Mich Talebzadeh
>  wrote:
> > it should be lit(0) :)
> >
> > rs.select(mySubstr($"transactiondescription", lit(0),
> > instr($"transactiondescription", "CD"))).show(1)
> > +--+
> > |UDF(transactiondescription,0,instr(transactiondescription,CD))|
> > +--+
> > |  OVERSEAS TRANSACTI C|
> > +--+
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 2 August 2016 at 08:52, Mich Talebzadeh 
> > wrote:
> >>
> >> No thinking on my part!!!
> >>
> >> rs.select(mySubstr($"transactiondescription", lit(1),
> >> instr($"transactiondescription", "CD"))).show(2)
> >> +--+
> >> |UDF(transactiondescription,1,instr(transactiondescription,CD))|
> >> +--+
> >> |   VERSEAS TRANSACTI C|
> >> |   XYZ.COM 80...|
> >> +--+
> >> only showing top 2 rows
> >>
> >> Let me test it.
> >>
> >> Cheers
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly
> disclaimed. The
> >> author will in no case be liable for any monetary damages arising from
> such
> >> loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On 1 August 2016 at 23:43, Mich Talebzadeh 
> >> wrote:
> >>>
> >>> Thanks Jacek.
> >>>
> >>> It sounds like the issue the position of the second variable in
> >>> substring()
> >>>
> >>> This works
> >>>
> >>> scala> val wSpec2 =
> >>> Window.partitionBy(substring($"transactiondescription",1,20))
> >>> wSpec2: org.apache.spark.sql.expressions.WindowSpec =
> >>> org.apache.spark.sql.expressions.WindowSpec@1a4eae2
> >>>
> >>> Using udf as suggested
> >>>
> >>> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
> >>>  |  s.substring(start, end) }
> >>> mySubstr: org.apache.spark.sql.UserDefinedFunction =
> >>> UserDefinedFunction(,StringType,List(StringType,

Re: Does it has a way to config limit in query on STS by default?

I don't think it really works and it is vague. Is it rows, blocks, network?



[image: Inline images 1]

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 12:09, Chanh Le  wrote:

> Hi Ayan,
> You mean
> common.max_count = 1000
> Max number of SQL result to *display to prevent the browser overload*.
> This is common properties for all connections
>
>
>
>
> It already set default in Zeppelin but I think it doesn’t work with Hive.
>
>
> DOC: http://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/interpreter/jdbc.html
>
>
> On Aug 2, 2016, at 6:03 PM, ayan guha  wrote:
>
> Zeppelin already has a param for jdbc
> On 2 Aug 2016 19:50, "Mich Talebzadeh"  wrote:
>
>> Ok I have already set up mine
>>
>> 
>> hive.limit.optimize.fetch.max
>> 5
>> 
>>   Maximum number of rows allowed for a smaller subset of data for
>> simple LIMIT, if it is a fetch query.
>>   Insert queries are not restricted by this limit.
>> 
>>   
>>
>> I am surprised that yours was missing. What did you set it up to?
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 10:18, Chanh Le  wrote:
>>
>>> I tried and it works perfectly.
>>>
>>> Regards,
>>> Chanh
>>>
>>>
>>> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> OK
>>>
>>> Try that
>>>
>>> Another tedious way is to create views in Hive based on tables and use
>>> limit on those views.
>>>
>>> But try that parameter first if it does anything.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 August 2016 at 09:13, Chanh Le  wrote:
>>>
 Hi Mich,
 I use Spark Thrift Server basically it acts like Hive.

 I see that there is property in Hive.

 hive.limit.optimize.fetch.max

- Default Value: 5
- Added In: Hive 0.8.0

 Maximum number of rows allowed for a smaller subset of data for simple
 LIMIT, if it is a fetch query. Insert queries are not restricted by this
 limit.


 Is that related to the problem?




 On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh 
 wrote:

 This is a classic problem on any RDBMS

 Set the limit on the number of rows returned like maximum of 50K rows
 through JDBC

 What is your JDBC connection going to? Meaning which RDBMS if any?

 HTH

 Dr Mich Talebzadeh


 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *


 http://talebzadehmich.wordpress.com

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 2 August 2016 at 08:41, Chanh Le  wrote:

> Hi everyone,
> I setup STS and use Zeppelin to

Re: Testing --supervise flag

2016-08-02 Thread Noorul Islam Kamal Malmiyoda

Widening to dev@spark

On Mon, Aug 1, 2016 at 4:21 PM, Noorul Islam K M  wrote:
>
> Hi all,
>
> I was trying to test --supervise flag of spark-submit.
>
> The documentation [1] says that, the flag helps in restarting your
> application automatically if it exited with non-zero exit code.
>
> I am looking for some clarification on that documentation. In this
> context, does application means the driver?
>
> Will the driver be re-launched if an exception is thrown by the
> application? I tested this scenario and the driver is not re-launched.
>
> ~/spark-1.6.1/bin/spark-submit --deploy-mode cluster --master 
> spark://10.29.83.162:6066 --class 
> org.apache.spark.examples.ExceptionHandlingTest 
> /home/spark/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.6.0.jar
>
> I killed the driver java process using 'kill -9' command and the driver
> is re-launched.
>
> Is this the only scenario were driver will be re-launched? Is there a
> way to simulate non-zero exit code and test the use of --supervise flag?
>
> Regards,
> Noorul
>
> [1] 
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Does it has a way to config limit in query on STS by default?

I tried and it works perfectly.

Regards,
Chanh


> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh  wrote:
> 
> OK
> 
> Try that
> 
> Another tedious way is to create views in Hive based on tables and use limit 
> on those views.
> 
> But try that parameter first if it does anything.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 09:13, Chanh Le  > wrote:
> Hi Mich,
> I use Spark Thrift Server basically it acts like Hive.
> 
> I see that there is property in Hive.
> 
>> hive.limit.optimize.fetch.max
>> Default Value: 5
>> Added In: Hive 0.8.0
>> Maximum number of rows allowed for a smaller subset of data for simple 
>> LIMIT, if it is a fetch query. Insert queries are not restricted by this 
>> limit.
> 
> Is that related to the problem?
> 
> 
> 
> 
>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh > > wrote:
>> 
>> This is a classic problem on any RDBMS
>> 
>> Set the limit on the number of rows returned like maximum of 50K rows 
>> through JDBC
>> 
>> What is your JDBC connection going to? Meaning which RDBMS if any?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 August 2016 at 08:41, Chanh Le > > wrote:
>> Hi everyone,
>> I setup STS and use Zeppelin to query data through JDBC connection.
>> A problem we are facing is users usually forget to put limit in the query so 
>> it causes hang the cluster.
>> 
>> SELECT * FROM tableA;
>> 
>> Is there anyway to config the limit by default ?
>> 
>> 
>> Regards,
>> Chanh
>> 
> 
>

Re: sql to spark scala rdd

2016-08-02 Thread Sri

Make sense thanks.

Thanks
Sri

Sent from my iPhone

> On 2 Aug 2016, at 03:27, Jacek Laskowski  wrote:
> 
> Congrats!
> 
> Whenever I was doing foreach(println) in the past I'm .toDF.show these
> days. Give it a shot and you'll experience the feeling yourself! :)
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Tue, Aug 2, 2016 at 4:25 AM, sri hari kali charan Tummala
>  wrote:
>> Hi All,
>> 
>> Below code calculates cumulative sum (running sum) and moving average using
>> scala RDD type of programming, I was using wrong function which is sliding
>> use scalleft instead.
>> 
>> 
>> sc.textFile("C:\\Users\\kalit_000\\Desktop\\Hadoop_IMP_DOC\\spark\\data.txt")
>>  .map(x => x.split("\\~"))
>>  .map(x => (x(0), x(1), x(2).toDouble))
>>  .groupBy(_._1)
>>  .mapValues{(x => x.toList.sortBy(_._2).zip(Stream from
>> 1).scanLeft(("","",0.0,0.0,0.0,0.0))
>>  { (a,b) => (b._1._1,b._1._2,b._1._3,(b._1._3.toDouble +
>> a._3.toDouble),(b._1._3.toDouble + a._3.toDouble)/b._2,b._2)}.tail)}
>>  .flatMapValues(x => x.sortBy(_._1))
>>  .foreach(println)
>> 
>> Input Data:-
>> 
>> Headers:-
>> Key,Date,balance
>> 
>> 786~20160710~234
>> 786~20160709~-128
>> 786~20160711~-457
>> 987~20160812~456
>> 987~20160812~567
>> 
>> Output Data:-
>> 
>> Column Headers:-
>> key, (key,Date,balance , daily balance, running average , row_number based
>> on key)
>> 
>> (786,(786,20160709,-128.0,-128.0,-128.0,1.0))
>> (786,(786,20160710,234.0,106.0,53.0,2.0))
>> (786,(786,20160711,-457.0,-223.0,-74.33,3.0))
>> 
>> (987,(987,20160812,567.0,1023.0,511.5,2.0))
>> (987,(987,20160812,456.0,456.0,456.0,1.0))
>> 
>> Reference:-
>> 
>> https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
>> 
>> 
>> Thanks
>> Sri
>> 
>> 
>>> On Mon, Aug 1, 2016 at 12:07 AM, Sri  wrote:
>>> 
>>> Hi ,
>>> 
>>> I solved it using spark SQL which uses similar window functions mentioned
>>> below , for my own knowledge I am trying to solve using Scala RDD which I am
>>> unable to.
>>> What function in Scala supports window function like SQL unbounded
>>> preceding and current row ? Is it sliding ?
>>> 
>>> 
>>> Thanks
>>> Sri
>>> 
>>> Sent from my iPhone
>>> 
>>> On 31 Jul 2016, at 23:16, Mich Talebzadeh 
>>> wrote:
>>> 
>>> hi
>>> 
>>> You mentioned:
>>> 
>>> I already solved it using DF and spark sql ...
>>> 
>>> Are you referring to this code which is a classic analytics:
>>> 
>>> SELECT DATE,balance,
>>> 
>>> SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>>> 
>>> AND
>>> 
>>> CURRENT ROW) daily_balance
>>> 
>>> FROM  table
>>> 
>>> 
>>> So how did you solve it using DF in the first place?
>>> 
>>> 
>>> HTH
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>> 
>>> 
>>> 
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> 
>>> 
>>> 
>>> http://talebzadehmich.wordpress.com
>>> 
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed. The
>>> author will in no case be liable for any monetary damages arising from such
>>> loss, damage or destruction.
>>> 
>>> 
>>> 
>>> 
 On 1 August 2016 at 07:04, Sri  wrote:
 
 Hi ,
 
 Just wondering how spark SQL works behind the scenes does it not convert
 SQL to some Scala RDD ? Or Scala ?
 
 How to write below SQL in Scala or Scala RDD
 
 SELECT DATE,balance,
 
 SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
 
 AND
 
 CURRENT ROW) daily_balance
 
 FROM  table
 
 
 Thanks
 Sri
 Sent from my iPhone
 
 On 31 Jul 2016, at 13:21, Jacek Laskowski  wrote:
 
 Hi,
 
 Impossible - see
 
 http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr].
 
 I tried to show you why you ended up with "non-empty iterator" after
 println. You should really start with
 http://www.scala-lang.org/documentation/
 
 Pozdrawiam,
 Jacek Laskowski
 
 https://medium.com/@jaceklaskowski/
 Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
 Follow me at https://twitter.com/jaceklaskowski
 
 
 On Sun, Jul 31, 2016 at 8:49 PM, sri hari kali charan Tummala
  wrote:
 
 Tuple
 
 
 [Lscala.Tuple2;@65e4cb84
 
 
 On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski  wrote:
 
 
 Hi,
 
 
 What's the result type of sliding(2,1)?
 
 
 Pozdrawiam,
 
 Jacek

Re: sql to spark scala rdd

Congrats!

Whenever I was doing foreach(println) in the past I'm .toDF.show these
days. Give it a shot and you'll experience the feeling yourself! :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 2, 2016 at 4:25 AM, sri hari kali charan Tummala
 wrote:
> Hi All,
>
> Below code calculates cumulative sum (running sum) and moving average using
> scala RDD type of programming, I was using wrong function which is sliding
> use scalleft instead.
>
>
> sc.textFile("C:\\Users\\kalit_000\\Desktop\\Hadoop_IMP_DOC\\spark\\data.txt")
>   .map(x => x.split("\\~"))
>   .map(x => (x(0), x(1), x(2).toDouble))
>   .groupBy(_._1)
>   .mapValues{(x => x.toList.sortBy(_._2).zip(Stream from
> 1).scanLeft(("","",0.0,0.0,0.0,0.0))
>   { (a,b) => (b._1._1,b._1._2,b._1._3,(b._1._3.toDouble +
> a._3.toDouble),(b._1._3.toDouble + a._3.toDouble)/b._2,b._2)}.tail)}
>   .flatMapValues(x => x.sortBy(_._1))
>   .foreach(println)
>
> Input Data:-
>
> Headers:-
> Key,Date,balance
>
> 786~20160710~234
> 786~20160709~-128
> 786~20160711~-457
> 987~20160812~456
> 987~20160812~567
>
> Output Data:-
>
> Column Headers:-
> key, (key,Date,balance , daily balance, running average , row_number based
> on key)
>
> (786,(786,20160709,-128.0,-128.0,-128.0,1.0))
> (786,(786,20160710,234.0,106.0,53.0,2.0))
> (786,(786,20160711,-457.0,-223.0,-74.33,3.0))
>
> (987,(987,20160812,567.0,1023.0,511.5,2.0))
> (987,(987,20160812,456.0,456.0,456.0,1.0))
>
> Reference:-
>
> https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/
>
>
> Thanks
> Sri
>
>
> On Mon, Aug 1, 2016 at 12:07 AM, Sri  wrote:
>>
>> Hi ,
>>
>> I solved it using spark SQL which uses similar window functions mentioned
>> below , for my own knowledge I am trying to solve using Scala RDD which I am
>> unable to.
>> What function in Scala supports window function like SQL unbounded
>> preceding and current row ? Is it sliding ?
>>
>>
>> Thanks
>> Sri
>>
>> Sent from my iPhone
>>
>> On 31 Jul 2016, at 23:16, Mich Talebzadeh 
>> wrote:
>>
>> hi
>>
>> You mentioned:
>>
>> I already solved it using DF and spark sql ...
>>
>> Are you referring to this code which is a classic analytics:
>>
>> SELECT DATE,balance,
>>
>>  SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>>
>>  AND
>>
>>  CURRENT ROW) daily_balance
>>
>>  FROM  table
>>
>>
>> So how did you solve it using DF in the first place?
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 1 August 2016 at 07:04, Sri  wrote:
>>>
>>> Hi ,
>>>
>>> Just wondering how spark SQL works behind the scenes does it not convert
>>> SQL to some Scala RDD ? Or Scala ?
>>>
>>> How to write below SQL in Scala or Scala RDD
>>>
>>> SELECT DATE,balance,
>>>
>>> SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>>>
>>> AND
>>>
>>> CURRENT ROW) daily_balance
>>>
>>> FROM  table
>>>
>>>
>>> Thanks
>>> Sri
>>> Sent from my iPhone
>>>
>>> On 31 Jul 2016, at 13:21, Jacek Laskowski  wrote:
>>>
>>> Hi,
>>>
>>> Impossible - see
>>>
>>> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr].
>>>
>>> I tried to show you why you ended up with "non-empty iterator" after
>>> println. You should really start with
>>> http://www.scala-lang.org/documentation/
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Sun, Jul 31, 2016 at 8:49 PM, sri hari kali charan Tummala
>>>  wrote:
>>>
>>> Tuple
>>>
>>>
>>> [Lscala.Tuple2;@65e4cb84
>>>
>>>
>>> On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski  wrote:
>>>
>>>
>>> Hi,
>>>
>>>
>>> What's the result type of sliding(2,1)?
>>>
>>>
>>> Pozdrawiam,
>>>
>>> Jacek Laskowski
>>>
>>> 
>>>
>>> https://medium.com/@jaceklaskowski/
>>>
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>>
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>>
>>> On Sun, Jul 31, 2016 at 9:23 AM, sri hari kali charan Tummala
>>>
>>>  wrote:
>>>
>>> tried this no luck, wht is non-empty iterator here ?
>>>
>>>
>>> OP:-
>>>
>>>

Re: Does it has a way to config limit in query on STS by default?

Hi Ayan, 
You mean 
common.max_count = 1000
Max number of SQL result to display to prevent the browser overload. This is 
common properties for all connections




It already set default in Zeppelin but I think it doesn’t work with Hive.


DOC: http://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/interpreter/jdbc.html 



> On Aug 2, 2016, at 6:03 PM, ayan guha  wrote:
> 
> Zeppelin already has a param for jdbc
> 
> On 2 Aug 2016 19:50, "Mich Talebzadeh"  > wrote:
> Ok I have already set up mine
> 
> 
> hive.limit.optimize.fetch.max
> 5
> 
>   Maximum number of rows allowed for a smaller subset of data for simple 
> LIMIT, if it is a fetch query.
>   Insert queries are not restricted by this limit.
> 
>   
> 
> I am surprised that yours was missing. What did you set it up to?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 10:18, Chanh Le  > wrote:
> I tried and it works perfectly.
> 
> Regards,
> Chanh
> 
> 
>> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh > > wrote:
>> 
>> OK
>> 
>> Try that
>> 
>> Another tedious way is to create views in Hive based on tables and use limit 
>> on those views.
>> 
>> But try that parameter first if it does anything.
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 August 2016 at 09:13, Chanh Le > > wrote:
>> Hi Mich,
>> I use Spark Thrift Server basically it acts like Hive.
>> 
>> I see that there is property in Hive.
>> 
>>> hive.limit.optimize.fetch.max
>>> Default Value: 5
>>> Added In: Hive 0.8.0
>>> Maximum number of rows allowed for a smaller subset of data for simple 
>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this 
>>> limit.
>> 
>> Is that related to the problem?
>> 
>> 
>> 
>> 
>>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh >> > wrote:
>>> 
>>> This is a classic problem on any RDBMS
>>> 
>>> Set the limit on the number of rows returned like maximum of 50K rows 
>>> through JDBC
>>> 
>>> What is your JDBC connection going to? Meaning which RDBMS if any?
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 2 August 2016 at 08:41, Chanh Le >> > wrote:
>>> Hi everyone,
>>> I setup STS and use Zeppelin to query data through JDBC connection.
>>> A problem we are facing is users usually forget to put limit in the query 
>>> so it causes hang the cluster.
>>> 
>>> SELECT * FROM tableA;
>>> 
>>> Is there anyway to config the limit by default ?
>>> 
>>> 
>>> Regards,
>>> Chanh
>>> 
>> 
>> 
> 
>

Re: java.net.UnknownHostException

2016-08-02 Thread Yang Cao

actually, i just came into same problem. Whether you can share some code around 
the error, then I can figure it out whether I can help you. And the 
"s001.bigdata” is your name of name node?
> On 2016年8月2日, at 17:22, pseudo oduesp  wrote:
> 
> someone can help me please 
> 
> 2016-08-01 11:51 GMT+02:00 pseudo oduesp  >:
> hi 
> i get the following erreors when i try using pyspark 2.0 with ipython   on 
> yarn 
> somone can help me please .
> java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> s001.bigdata.;s003.bigdata;s008bigdata.
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:823)
> at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:779)
> at 
> org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
> at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:133)
> at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:130)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:94)
> at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:130)
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:367)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> java.net.UnknownHostException:s001.bigdata.;s003.bigdata;s008bigdata.
> 
> 
> thanks 
>

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath

Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang  wrote:

> Hi Hao,
>
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
>
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
>
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
>
> Thanks
> Yanbo
>
> 2016-08-01 15:29 GMT-07:00 Hao Ren :
>
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>>
>> Is this a wanted feature ?
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>

Re: java.net.UnknownHostException

2016-08-02 Thread pseudo oduesp

someone can help me please 

2016-08-01 11:51 GMT+02:00 pseudo oduesp :

> hi
> i get the following erreors when i try using pyspark 2.0 with ipython   on
> yarn
> somone can help me please .
> java.lang.IllegalArgumentException: java.net.UnknownHostException:
> s001.bigdata.;s003.bigdata;s008bigdata.
> at
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:823)
> at
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:779)
> at
> org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:133)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:130)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:94)
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:130)
> at
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:367)
> at
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
> at
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
> at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
> at org.apache.spark.SparkContext.(SparkContext.scala:500)
> at
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by:
> java.net.UnknownHostException:s001.bigdata.;s003.bigdata;s008bigdata.
>
>
> thanks
>

Re: Does it has a way to config limit in query on STS by default?

2016-08-02 Thread ayan guha

Zeppelin already has a param for jdbc
On 2 Aug 2016 19:50, "Mich Talebzadeh"  wrote:

> Ok I have already set up mine
>
> 
> hive.limit.optimize.fetch.max
> 5
> 
>   Maximum number of rows allowed for a smaller subset of data for
> simple LIMIT, if it is a fetch query.
>   Insert queries are not restricted by this limit.
> 
>   
>
> I am surprised that yours was missing. What did you set it up to?
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 10:18, Chanh Le  wrote:
>
>> I tried and it works perfectly.
>>
>> Regards,
>> Chanh
>>
>>
>> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh 
>> wrote:
>>
>> OK
>>
>> Try that
>>
>> Another tedious way is to create views in Hive based on tables and use
>> limit on those views.
>>
>> But try that parameter first if it does anything.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 09:13, Chanh Le  wrote:
>>
>>> Hi Mich,
>>> I use Spark Thrift Server basically it acts like Hive.
>>>
>>> I see that there is property in Hive.
>>>
>>> hive.limit.optimize.fetch.max
>>>
>>>- Default Value: 5
>>>- Added In: Hive 0.8.0
>>>
>>> Maximum number of rows allowed for a smaller subset of data for simple
>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>>> limit.
>>>
>>>
>>> Is that related to the problem?
>>>
>>>
>>>
>>>
>>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh 
>>> wrote:
>>>
>>> This is a classic problem on any RDBMS
>>>
>>> Set the limit on the number of rows returned like maximum of 50K rows
>>> through JDBC
>>>
>>> What is your JDBC connection going to? Meaning which RDBMS if any?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 2 August 2016 at 08:41, Chanh Le  wrote:
>>>
 Hi everyone,
 I setup STS and use Zeppelin to query data through JDBC connection.
 A problem we are facing is users usually forget to put limit in the
 query so it causes hang the cluster.

 SELECT * FROM tableA;

 Is there anyway to config the limit by default ?


 Regards,
 Chanh
>>>
>>>
>>>
>>>
>>
>>
>

Re: The equivalent for INSTR in Spark FP

Congrats! You made it. A serious Spark dev badge unlocked :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 2, 2016 at 9:58 AM, Mich Talebzadeh
 wrote:
> it should be lit(0) :)
>
> rs.select(mySubstr($"transactiondescription", lit(0),
> instr($"transactiondescription", "CD"))).show(1)
> +--+
> |UDF(transactiondescription,0,instr(transactiondescription,CD))|
> +--+
> |  OVERSEAS TRANSACTI C|
> +--+
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 2 August 2016 at 08:52, Mich Talebzadeh 
> wrote:
>>
>> No thinking on my part!!!
>>
>> rs.select(mySubstr($"transactiondescription", lit(1),
>> instr($"transactiondescription", "CD"))).show(2)
>> +--+
>> |UDF(transactiondescription,1,instr(transactiondescription,CD))|
>> +--+
>> |   VERSEAS TRANSACTI C|
>> |   XYZ.COM 80...|
>> +--+
>> only showing top 2 rows
>>
>> Let me test it.
>>
>> Cheers
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed. The
>> author will in no case be liable for any monetary damages arising from such
>> loss, damage or destruction.
>>
>>
>>
>>
>> On 1 August 2016 at 23:43, Mich Talebzadeh 
>> wrote:
>>>
>>> Thanks Jacek.
>>>
>>> It sounds like the issue the position of the second variable in
>>> substring()
>>>
>>> This works
>>>
>>> scala> val wSpec2 =
>>> Window.partitionBy(substring($"transactiondescription",1,20))
>>> wSpec2: org.apache.spark.sql.expressions.WindowSpec =
>>> org.apache.spark.sql.expressions.WindowSpec@1a4eae2
>>>
>>> Using udf as suggested
>>>
>>> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
>>>  |  s.substring(start, end) }
>>> mySubstr: org.apache.spark.sql.UserDefinedFunction =
>>> UserDefinedFunction(,StringType,List(StringType, IntegerType,
>>> IntegerType))
>>>
>>>
>>> This was throwing error:
>>>
>>> val wSpec2 =
>>> Window.partitionBy(substring("transactiondescription",1,indexOf("transactiondescription",'CD')-2))
>>>
>>>
>>> So I tried using udf
>>>
>>> scala> val wSpec2 =
>>> Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
>>> instr('s, "CD")))
>>>  | )
>>> :28: error: value select is not a member of
>>> org.apache.spark.sql.ColumnName
>>>  val wSpec2 =
>>> Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
>>> instr('s, "CD")))
>>>
>>> Obviously I am not doing correctly :(
>>>
>>> cheers
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed. The
>>> author will in no case be liable for any monetary damages arising from such
>>> loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On 1 August 2016 at 23:02, Jacek Laskowski  wrote:

 Hi,

 Interesting...

 I'm temping to think that substring function should accept the columns
 that hold the numbers for start and end. I'd love hearing people's
 thought on this.

 For now, I'd say you need to define udf to do substring as follows:

 scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
 s.substring(start, end) }
 mySubstr: org.apache.spark.sql.expressions.UserDefinedFunction =

issue with coalesce in Spark 2.0.0

2016-08-02 Thread ??????

Hi all.  
Hi all.



 
I'm testing on Spark 2.0.0 and found an issue when using coalesce in my 
code.
 
The procedure is simple doing a coalesce for a RDD[Stirng], and this 
happened:
 
   java.lang.NoSuchMethodError: 
org.apache.spark.rdd.RDD.coalesce(IZLscala/math/Ordering;)Lorg/apache/spark/rdd/RDD;
 
 
 
 
The environment is Spark 2.0.0 with Scala 2.11.8, while the application jar 
is compiled with  Spark 1.6.2  and  Scala 2.10.6 . 

BTW, it works OK with Spark 2.0.0 and Scala 2.11.8. So I presume it??s a 
compatibility issue?




__

Best regards.

Re: Does it has a way to config limit in query on STS by default?

Ok I have already set up mine


hive.limit.optimize.fetch.max
5

  Maximum number of rows allowed for a smaller subset of data for
simple LIMIT, if it is a fetch query.
  Insert queries are not restricted by this limit.

  

I am surprised that yours was missing. What did you set it up to?







Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 10:18, Chanh Le  wrote:

> I tried and it works perfectly.
>
> Regards,
> Chanh
>
>
> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh 
> wrote:
>
> OK
>
> Try that
>
> Another tedious way is to create views in Hive based on tables and use
> limit on those views.
>
> But try that parameter first if it does anything.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 09:13, Chanh Le  wrote:
>
>> Hi Mich,
>> I use Spark Thrift Server basically it acts like Hive.
>>
>> I see that there is property in Hive.
>>
>> hive.limit.optimize.fetch.max
>>
>>- Default Value: 5
>>- Added In: Hive 0.8.0
>>
>> Maximum number of rows allowed for a smaller subset of data for simple
>> LIMIT, if it is a fetch query. Insert queries are not restricted by this
>> limit.
>>
>>
>> Is that related to the problem?
>>
>>
>>
>>
>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh 
>> wrote:
>>
>> This is a classic problem on any RDBMS
>>
>> Set the limit on the number of rows returned like maximum of 50K rows
>> through JDBC
>>
>> What is your JDBC connection going to? Meaning which RDBMS if any?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 08:41, Chanh Le  wrote:
>>
>>> Hi everyone,
>>> I setup STS and use Zeppelin to query data through JDBC connection.
>>> A problem we are facing is users usually forget to put limit in the
>>> query so it causes hang the cluster.
>>>
>>> SELECT * FROM tableA;
>>>
>>> Is there anyway to config the limit by default ?
>>>
>>>
>>> Regards,
>>> Chanh
>>
>>
>>
>>
>
>

Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?

2016-08-02 Thread dueckm

Hello,

I built a prototype that uses join and groupBy operations via Spark RDD API.
Recently I migrated it to the Dataset API. Now it runs much slower than with
the original RDD implementation. Did I do something wrong here? Or is this
the price I have to pay for the more convienient API?
Is there a known solution to deal with this effect (eg configuration via
"spark.sql.shuffle.partitions" - but how could I determine the correct
value)?
In my prototype I use Java Beans with a lot of attributes. Does this slow
down Spark-operations with Datasets?

Here I have an simple example, that shows the difference:
JoinGroupByTest.zip

- I build 2 RDDs and join and group them. Afterwards I count and display the
joined RDDs. (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
- When I do the same actions with Datasets it takes approximately 40 times
as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).

Thank you very much for your help.
Matthias

PS1: is a duplicate issue to
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27445.html
because the the sign-up confirmation process wa snot completed when posting
this topic.

PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
RDD implementation, jobs 2/3 to Dataset):

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27448.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark 2.0 readStream from a REST API

On Tue, Aug 2, 2016 at 10:59 AM, Ayoub Benali
 wrote:
> Why writeStream is needed to consume the data ?

You need to start your structured streaming (query) and there's no way
to access it without DataStreamWriter => writeStream's a must.

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: The equivalent for INSTR in Spark FP

Apologies it should read Jacek. Confused with my friend's name Jared :(

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 11:18, Mich Talebzadeh 
wrote:

> Thanks Jared for your kind words. I don't think I am anywhere near there
> yet :)
>
> In general I subtract one character before getting to "CD". That is the
> way debit from debit cards are marked in a Bank's statement.
>
> I get out of bound error if -->
> select(mySubstr($"transactiondescription",lit(0),instr($"transactiondescription",
> "CD")-1)   fails with the length. So I did
>
> ll_18740868.where($"transactiontype" === "DEB" &&
> $"transactiondescription" > "
> ").select(mySubstr($"transactiondescription",lit(0),instr($"transactiondescription",
> "CD")-1),$"debitamount").collect.foreach(println)
>
> which basically examines if the value of $"transactiondescription" > "  "
> then do the substring
>
> Now are there better options than that, say make UDF handle the error
> when length($"transactiondescription" < 2 or it is null etc and returns
> something to avoid the program crashing?
>
> Thanks again for your help.
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 10:57, Jacek Laskowski  wrote:
>
>> Congrats! You made it. A serious Spark dev badge unlocked :)
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Aug 2, 2016 at 9:58 AM, Mich Talebzadeh
>>  wrote:
>> > it should be lit(0) :)
>> >
>> > rs.select(mySubstr($"transactiondescription", lit(0),
>> > instr($"transactiondescription", "CD"))).show(1)
>> > +--+
>> > |UDF(transactiondescription,0,instr(transactiondescription,CD))|
>> > +--+
>> > |  OVERSEAS TRANSACTI C|
>> > +--+
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly
>> disclaimed. The
>> > author will in no case be liable for any monetary damages arising from
>> such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>> > On 2 August 2016 at 08:52, Mich Talebzadeh 
>> > wrote:
>> >>
>> >> No thinking on my part!!!
>> >>
>> >> rs.select(mySubstr($"transactiondescription", lit(1),
>> >> instr($"transactiondescription", "CD"))).show(2)
>> >> +--+
>> >> |UDF(transactiondescription,1,instr(transactiondescription,CD))|
>> >> +--+
>> >> |   VERSEAS TRANSACTI C|
>> >> |   XYZ.COM 80...|
>> >> +--+
>> >> only showing top 2 rows
>> >>
>> >> Let me test it.
>> >>
>> >> Cheers
>> >>
>> >>
>> >>
>> >> Dr Mich Talebzadeh
>> >>
>> >>
>> >>
>> >> LinkedIn
>> >>
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >>
>> >>
>> >>
>> >> http://talebzadehmich.wordpress.com
>> >>
>> >>
>> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> >> loss, damage or destruction of data or any other property which may
>> arise
>> >> from relying on this email's technical

Re: Does it has a way to config limit in query on STS by default?

I just added to spark thrift server as it starts a param —hiveconf 
hive.limit.optimize.fetch.max=1000 




> On Aug 2, 2016, at 4:50 PM, Mich Talebzadeh  wrote:
> 
> Ok I have already set up mine
> 
> 
> hive.limit.optimize.fetch.max
> 5
> 
>   Maximum number of rows allowed for a smaller subset of data for simple 
> LIMIT, if it is a fetch query.
>   Insert queries are not restricted by this limit.
> 
>   
> 
> I am surprised that yours was missing. What did you set it up to?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 10:18, Chanh Le  > wrote:
> I tried and it works perfectly.
> 
> Regards,
> Chanh
> 
> 
>> On Aug 2, 2016, at 3:33 PM, Mich Talebzadeh > > wrote:
>> 
>> OK
>> 
>> Try that
>> 
>> Another tedious way is to create views in Hive based on tables and use limit 
>> on those views.
>> 
>> But try that parameter first if it does anything.
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 2 August 2016 at 09:13, Chanh Le > > wrote:
>> Hi Mich,
>> I use Spark Thrift Server basically it acts like Hive.
>> 
>> I see that there is property in Hive.
>> 
>>> hive.limit.optimize.fetch.max
>>> Default Value: 5
>>> Added In: Hive 0.8.0
>>> Maximum number of rows allowed for a smaller subset of data for simple 
>>> LIMIT, if it is a fetch query. Insert queries are not restricted by this 
>>> limit.
>> 
>> Is that related to the problem?
>> 
>> 
>> 
>> 
>>> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh >> > wrote:
>>> 
>>> This is a classic problem on any RDBMS
>>> 
>>> Set the limit on the number of rows returned like maximum of 50K rows 
>>> through JDBC
>>> 
>>> What is your JDBC connection going to? Meaning which RDBMS if any?
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> On 2 August 2016 at 08:41, Chanh Le >> > wrote:
>>> Hi everyone,
>>> I setup STS and use Zeppelin to query data through JDBC connection.
>>> A problem we are facing is users usually forget to put limit in the query 
>>> so it causes hang the cluster.
>>> 
>>> SELECT * FROM tableA;
>>> 
>>> Is there anyway to config the limit by default ?
>>> 
>>> 
>>> Regards,
>>> Chanh
>>> 
>> 
>> 
> 
>

Extracting key word from a textual column

Hi,

Need some ideas.

*Summary:*

I am working on a tool to slice and dice the amount of money I have spent
so far (meaning the whole data sample) on a given retailer so I have a
better idea of where I am wasting the money

*Approach*

Downloaded my bank statements from a given account in csv format from
inception till end of July. Read the data and stored it in ORC table.

I am interested for all bills that I paid using Debit Card (
transactiontype = "DEB") that comes out the account directly.
Transactiontype is the three character code lookup that I download as well.

scala> ll_18740868.printSchema
root
 |-- transactiondate: date (nullable = true)
 |-- transactiontype: string (nullable = true)
 |-- sortcode: string (nullable = true)
 |-- accountnumber: string (nullable = true)
 |-- transactiondescription: string (nullable = true)
 |-- debitamount: double (nullable = true)
 |-- creditamount: double (nullable = true)
 |-- balance: double (nullable = true)

The important fields are transactiondate, transactiontype,
transactiondescription and debitamount

So using analytics. windowing I can do all sorts of things. For example
this one gives me the last time I spent money on retailer XYZ and the amount

SELECT *
FROM (
  select transactiondate, transactiondescription, debitamount
  , rank() over (order by transactiondate desc) AS rank
  from accounts.ll_18740868 where transactiondescription like '%XYZ%'
 ) tmp
where rank <= 1

And its equivalent using Windowing in FP

import org.apache.spark.sql.expressions.Window
val wSpec =
Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
rank().over(wSpec).as("rank")).filter($"rank"===1).show


+---+--++
|transactiondate|transactiondescription|rank|
+---+--++
| 2015-12-15|  XYZ LTD CD 4636 |   1|
+---+--++

So far so good. But if I want to find all I spent on each retailer, then it
gets trickier as a retailer appears like below in the column
transactiondescription:

ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
+--+
|transactiondescription|
+--+
|  SAINSBURYS SMKT C...|
|  SACAT SAINSBURYS ...|
|  SAINSBURY'S SMKT ...|
|  SAINSBURYS S/MKT ...|
|  SACAT SAINSBURYS ...|
+--+

If I look at them I know they all belong to SAINBURYS (food retailer). I
have done some crude grouping and it works somehow

//define UDF here to handle substring
val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
s.substring(start, end) }
var cutoff = "CD"  // currently used in the statement
val wSpec2 =
Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
cutoff)-1))
ll_18740868.where($"transactiontype" === "DEB" &&
($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)

However, I really need to extract the "keyword" retailer name from
transactiondescription column And I need some ideas about the best way of
doing it. Is this possible in Spark?

Thanks
Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API? [*]

2016-08-02 Thread dueckm



Hello,

I built a prototype that uses join and groupBy operations via Spark RDD
API. Recently I migrated it to the Dataset API. Now it runs much slower
than with the original RDD implementation. Did I do something wrong here?
Or is this the price I have to pay for the more convienient API?
Is there a known solution to deal with this effect (eg configuration via
"spark.sql.shuffle.partitions" - but how could I determine the correct
value)?
In my prototype I use Java Beans with a lot of attributes. Does this slow
down Spark-operations with Datasets?

Here I have an simple example, that shows the difference: (See attached
file: JoinGroupByTest.zip)
- I build 2 RDDs and join and group them. Afterwards I count and display
the joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD
() )
- When I do the same actions with Datasets it takes approximately 40 times
as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).

Thank you very much for your help.
Matthias

PS: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
RDD implementation, jobs 2/3 to Dataset):






Fiducia & GAD IT AG | www.fiduciagad.de
AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528
Frankfurt a. M. | USt-IdNr. DE 143582320
Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv.
Vorsitzender),
Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten
Pfläging, Jörg Staff
Vorsitzender des Aufsichtsrats: Jürgen Brinkmann

2D782357.gif (62K) 

2D546574.gif (98K) 

2D310440.gif (126K) 

JoinGroupByTest.zip (5K) 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27449.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

unsubscribe

2016-08-02 Thread doovsaid

unsubscribe




ZhangYi (张逸)
BigEye 
website: http://www.bigeyedata.com
blog: http://zhangyi.farbox.com
tel: 15023157626




- 原始邮件 -
发件人："zhangjp" <592426...@qq.com>
收件人："user" 
主题：unsubscribe
日期：2016年08月02日 11点00分

unsubscribe

decribe function limit of columns

2016-08-02 Thread pseudo oduesp

Hi
 in spark 1.5.0  i used  descibe function with more than 100 columns .
someone can tell me if any limit exsiste now ?

thanks

Re: In 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?

Hi,

Don't think so.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Aug 2, 2016 at 10:25 PM, pgb  wrote:
> I'm interested in learning if it's possible to grab the results set from a
> query run on an external database as opposed to grabbing the full table and
> manipulating it later. The base code I'm working with is below (using Spark
> 2.0.0):
>
> ```
> from pyspark.sql import SparkSession
>
> df = spark.read\
> .format("jdbc")\
> .option("url", "jdbc:mysql://localhost:port")\
> .option("dbtable", "schema.tablename")\
> .option("user", "username")\
> .option("password", "password")\
> .load()
> ```
>
> Many thanks,
> Pat
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/In-2-0-0-is-it-possible-to-fetch-a-query-from-an-external-database-rather-than-grab-the-whole-table-tp27453.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Sun Rui

import org.apache.spark.sql.catalyst.encoders.RowEncoder
implicit val encoder = RowEncoder(df.schema)
df.mapPartitions(_.take(1))

> On Aug 3, 2016, at 04:55, Dragisa Krsmanovic  wrote:
> 
> I am trying to use mapPartitions on DataFrame.
> 
> Example:
> 
> import spark.implicits._
> val df: DataFrame = Seq((1,"one"), (2, "two")).toDF("id", "name")
> df.mapPartitions(_.take(1))
> 
> I am getting:
> 
> Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
> String, etc) and Product types (case classes) are supported by importing 
> spark.implicits._  Support for serializing other types will be added in 
> future releases.
> 
> Since DataFrame is Dataset[Row], I was expecting encoder for Row to be there.
> 
> What's wrong with my code ?
>  
> 
> -- 
> Dragiša Krsmanović | Platform Engineer | Ticketfly
> 
> dragi...@ticketfly.com 
> @ticketfly  | ticketfly.com/blog 
>  | facebook.com/ticketfly 
>

FW: Stop Spark Streaming Jobs

2016-08-02 Thread Park Kyeong Hee

So sorry. Your name was Pradeep !!

-Original Message-
From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com] 
Sent: Wednesday, August 03, 2016 11:24 AM
To: 'Pradeep'; 'user@spark.apache.org'
Subject: RE: Stop Spark Streaming Jobs

Hi. Paradeep

Did you mean, how to kill the job?
If yes, you should kill the driver and follow next.

on yarn-client
1. find pid - "ps -es | grep "
2. kill it - "kill -9 "
3. check executors were down - "yarn application -list"

on yarn-cluster
1. find driver's application ID - "yarn application -list"
2. stop it - "yarn application -kill "
3. check driver and executors were down - "yarn application -list"

Thanks.

-Original Message-
From: Pradeep [mailto:pradeep.mi...@mail.com] 
Sent: Wednesday, August 03, 2016 10:48 AM
To: user@spark.apache.org
Subject: Stop Spark Streaming Jobs

Hi All,

My streaming job reads data from Kafka. The job is triggered and pushed to
background with nohup.

What are the recommended ways to stop job either on yarn-client or cluster
mode.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Extracting key word from a textual column

2016-08-02 Thread ayan guha

I would stay away from transaction tables until they are fully baked. I do
not see why you need to update vs keep inserting with timestamp and while
joining derive latest value on the fly.

But I guess it has became a religious question now :) and I am not
unbiased.
On 3 Aug 2016 08:51, "Mich Talebzadeh"  wrote:

> There are many ways of addressing this issue.
>
> Using Hbase with Phoenix adds another layer to the stack which is not
> necessary for handful of table and will add to cost (someone else has to
> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase
> table. It is faster)
>
> There may be say 100 new entries into this catalog table with multiple
> updates (not a single DML) to get hashtag right. sometimes it is an
> iterative process which results in many deltas.
>
> If that is needed done once a day or on demand, an alternative would be to
> insert overwrite the transactional hive table with deltas into a text table
> in Hive and present that one to Spark. This allows Spark to see the data.
>
> Remember if I use Hive to do the analytics/windowing, there is no issue.
> The issue is with Spark that neither Spark SQL or Spark shell can use that
> table.
>
> Sounds like an issue for Spark to resolve later.
>
> Another alternative one can leave the transactional table in RDBMS for
> this purpose and load it into DF through JDBC interface. It works fine and
> pretty fast.
>
> Again these are all workarounds. I discussed this in Hive forum. There
> should be a way" to manually compact a transactional table in Hive" (not
> possible now) and second point if Hive can see the data in Hive table, why
> not Spark?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 23:10, Ted Yu  wrote:
>
>> +1
>>
>> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
>>
>> If you need to use single inserts, updates, deletes, select why not use
>> hbase with Phoenix? I see it as complementary to the hive / warehouse
>> offering
>>
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I decided to create a catalog table in Hive ORC and transactional. That
>> table has two columns of value
>>
>>
>>1. transactiondescription === account_table.transactiondescription
>>2. hashtag String column created from a semi automated process of
>>deriving it from account_table.transactiondescription
>>
>> Once the process is complete in populating the catalog table then we just
>> need to create a new DF based on join between catalog table and the
>> account_table. The join will use hashtag in catalog table to loop over
>> debit column in account_table for a given hashtag. That is pretty fast as
>> going through pattern matching is pretty intensive in any application and
>> database in real time.
>>
>> So one can build up the catalog table over time as a reference table. I
>> am sure such tables exist in commercial world.
>>
>> Anyway after getting results out I know how I am wasting my money on
>> different things, especially on clothing  etc :)
>>
>>
>> HTH
>>
>> P.S. Also there is an issue with Spark not being able to read data
>> through Hive transactional tables that have not been compacted yet. Spark
>> just crashes. If these tables need to be updated regularly say catalog
>> table and they are pretty small, one might maintain them in an RDBMS and
>> read them once through JDBC into a DataFrame in Spark before doing
>> analytics.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>>
>>> Hi Mich,
>>>
>>> It seems like an entity resolution problem - looking at different
>>> representations of an entity - SAINSBURY in this case and matching them all
>>> together. How dirty is your

Re: Extracting key word from a textual column

2016-08-02 Thread Ted Yu

+1

> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
> 
> If you need to use single inserts, updates, deletes, select why not use hbase 
> with Phoenix? I see it as complementary to the hive / warehouse offering 
> 
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh  wrote:
>> 
>> Hi,
>> 
>> I decided to create a catalog table in Hive ORC and transactional. That 
>> table has two columns of value
>> 
>> transactiondescription === account_table.transactiondescription
>> hashtag String column created from a semi automated process of deriving it 
>> from account_table.transactiondescription
>> Once the process is complete in populating the catalog table then we just 
>> need to create a new DF based on join between catalog table and the 
>> account_table. The join will use hashtag in catalog table to loop over debit 
>> column in account_table for a given hashtag. That is pretty fast as going 
>> through pattern matching is pretty intensive in any application and database 
>> in real time.
>> 
>> So one can build up the catalog table over time as a reference table. I am 
>> sure such tables exist in commercial world.
>> 
>> Anyway after getting results out I know how I am wasting my money on 
>> different things, especially on clothing  etc :)
>> 
>> 
>> HTH
>> 
>> P.S. Also there is an issue with Spark not being able to read data through 
>> Hive transactional tables that have not been compacted yet. Spark just 
>> crashes. If these tables need to be updated regularly say catalog table and 
>> they are pretty small, one might maintain them in an RDBMS and read them 
>> once through JDBC into a DataFrame in Spark before doing analytics.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>>> Hi Mich,
>>> 
>>> It seems like an entity resolution problem - looking at different 
>>> representations of an entity - SAINSBURY in this case and matching them all 
>>> together. How dirty is your data in the description - are there stop words 
>>> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>>> 
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies 
>>> Reifier at Strata Hadoop World
>>> Reifier at Spark Summit 2015
>>> 
>>> 
>>> 
>>> 
>>> 
 On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
  wrote:
 Thanks.
 
 I believe there is some catalog of companies that I can get and store it 
 in a table and math the company name to transactiondesciption column.
 
 That catalog should have sectors in it. For example company XYZ is under 
 Grocers etc which will make search and grouping much easier.
 
 I believe Spark can do it, though I am generally interested on alternative 
 ideas.
 
 
 
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
 
 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.
  
 
> On 2 August 2016 at 16:26, Yong Zhang  wrote:
> Well, if you still want to use windows function for your logic, then you 
> need to derive a new column out, like "catalog", and use it as part of 
> grouping logic.
> 
> 
> Maybe you can use regex for deriving out this new column. The 
> implementation needs to depend on your data in "transactiondescription", 
> and regex gives you the most powerful way to handle your data.
> 
> 
> This is really not a Spark question, but how to you process your logic 
> based on the data given.
> 
> 
> Yong
> 
> 
> From: Mich Talebzadeh 
> Sent: Tuesday, August 2, 2016 10:00 AM
> To: user @spark
> Subject: Extracting key word from a textual column
>  
> Hi,
> 
> Need some ideas.
> 
> Summary:
> 
> I am working on a tool to slice and dice the amount of money I have spent 
> so far (meaning the whole data sample) on a given retailer so I have a 
> better idea of where I

Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-02 Thread Utkarsh Sengar

Upgraded to spark2.0 and tried to load a model:
LogisticRegressionModel model = LogisticRegressionModel.load(sc.sc(),
"s3a://cruncher/c/models/lr/");

Getting this error: Exception in thread "main"
java.lang.IllegalArgumentException: Wrong FS: file://spark-warehouse,
expected: file:///
Full stacktrace:
https://gist.githubusercontent.com/utkarsh2012/7c4c8e0f408e36a8fb6d9c9d3bd6b301/raw/2621ed3ceffb63d72ecdce169193dfabe4d41b40/spark2.0%2520LR%2520load


This was working fine in Spark 1.5.1. I don't have "spark-warehouse"
anywhere in my code, so its somehow defaulting to that.

-- 
Thanks,
-Utkarsh

Re: Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-02 Thread Sean Owen

This is https://issues.apache.org/jira/browse/SPARK-15899  -- anyone
seeing this please review the proposed change. I think it's stalled
and needs an update.

On Tue, Aug 2, 2016 at 4:47 PM, Utkarsh Sengar  wrote:
> Upgraded to spark2.0 and tried to load a model:
> LogisticRegressionModel model = LogisticRegressionModel.load(sc.sc(),
> "s3a://cruncher/c/models/lr/");
>
> Getting this error: Exception in thread "main"
> java.lang.IllegalArgumentException: Wrong FS: file://spark-warehouse,
> expected: file:///
> Full stacktrace:
> https://gist.githubusercontent.com/utkarsh2012/7c4c8e0f408e36a8fb6d9c9d3bd6b301/raw/2621ed3ceffb63d72ecdce169193dfabe4d41b40/spark2.0%2520LR%2520load
>
>
> This was working fine in Spark 1.5.1. I don't have "spark-warehouse"
> anywhere in my code, so its somehow defaulting to that.
>
> --
> Thanks,
> -Utkarsh

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-02 Thread Utkarsh Sengar

I don't think its a related problem, although by setting
"spark.sql.warehouse.dir"=/tmp in spark config fixed it.

On Tue, Aug 2, 2016 at 5:02 PM, Utkarsh Sengar 
wrote:

> Do we have a workaround for this problem?
> Can I overwrite that using some config?
>
> On Tue, Aug 2, 2016 at 4:48 PM, Sean Owen  wrote:
>
>> This is https://issues.apache.org/jira/browse/SPARK-15899  -- anyone
>> seeing this please review the proposed change. I think it's stalled
>> and needs an update.
>>
>> On Tue, Aug 2, 2016 at 4:47 PM, Utkarsh Sengar 
>> wrote:
>> > Upgraded to spark2.0 and tried to load a model:
>> > LogisticRegressionModel model = LogisticRegressionModel.load(sc.sc(),
>> > "s3a://cruncher/c/models/lr/");
>> >
>> > Getting this error: Exception in thread "main"
>> > java.lang.IllegalArgumentException: Wrong FS: file://spark-warehouse,
>> > expected: file:///
>> > Full stacktrace:
>> >
>> https://gist.githubusercontent.com/utkarsh2012/7c4c8e0f408e36a8fb6d9c9d3bd6b301/raw/2621ed3ceffb63d72ecdce169193dfabe4d41b40/spark2.0%2520LR%2520load
>> >
>> >
>> > This was working fine in Spark 1.5.1. I don't have "spark-warehouse"
>> > anywhere in my code, so its somehow defaulting to that.
>> >
>> > --
>> > Thanks,
>> > -Utkarsh
>>
>
>
>
> --
> Thanks,
> -Utkarsh
>



-- 
Thanks,
-Utkarsh

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Ted Yu

Using spark-shell of master branch:

scala> case class Entry(id: Integer, name: String)
defined class Entry

scala> val df  = Seq((1,"one"), (2, "two")).toDF("id", "name").as[Entry]
16/08/02 16:47:01 DEBUG package$ExpressionCanonicalizer:
=== Result of Batch CleanExpressions ===
!assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
object)._1 AS _1#10   assertnotnull(input[0, scala.Tuple2, true], top level
non-flat input object)._1
!+- assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
object)._1 +- assertnotnull(input[0, scala.Tuple2, true], top level
non-flat input object)
!   +- assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
object)+- input[0, scala.Tuple2, true]
!  +- input[0, scala.Tuple2, true]
...

scala> df.mapPartitions(_.take(1))

On Tue, Aug 2, 2016 at 1:55 PM, Dragisa Krsmanovic 
wrote:

> I am trying to use mapPartitions on DataFrame.
>
> Example:
>
> import spark.implicits._
> val df: DataFrame = Seq((1,"one"), (2, "two")).toDF("id", "name")
> df.mapPartitions(_.take(1))
>
> I am getting:
>
> Unable to find encoder for type stored in a Dataset.  Primitive types
> (Int, String, etc) and Product types (case classes) are supported by
> importing spark.implicits._  Support for serializing other types will be
> added in future releases.
>
> Since DataFrame is Dataset[Row], I was expecting encoder for Row to be
> there.
>
> What's wrong with my code ?
>
>
> --
>
> Dragiša Krsmanović | Platform Engineer | Ticketfly
>
> dragi...@ticketfly.com
>
> @ticketfly  | ticketfly.com/blog |
> facebook.com/ticketfly
>

Re: Spark 2.0 error: Wrong FS: file://spark-warehouse, expected: file:///

2016-08-02 Thread Utkarsh Sengar

Do we have a workaround for this problem?
Can I overwrite that using some config?

On Tue, Aug 2, 2016 at 4:48 PM, Sean Owen  wrote:

> This is https://issues.apache.org/jira/browse/SPARK-15899  -- anyone
> seeing this please review the proposed change. I think it's stalled
> and needs an update.
>
> On Tue, Aug 2, 2016 at 4:47 PM, Utkarsh Sengar 
> wrote:
> > Upgraded to spark2.0 and tried to load a model:
> > LogisticRegressionModel model = LogisticRegressionModel.load(sc.sc(),
> > "s3a://cruncher/c/models/lr/");
> >
> > Getting this error: Exception in thread "main"
> > java.lang.IllegalArgumentException: Wrong FS: file://spark-warehouse,
> > expected: file:///
> > Full stacktrace:
> >
> https://gist.githubusercontent.com/utkarsh2012/7c4c8e0f408e36a8fb6d9c9d3bd6b301/raw/2621ed3ceffb63d72ecdce169193dfabe4d41b40/spark2.0%2520LR%2520load
> >
> >
> > This was working fine in Spark 1.5.1. I don't have "spark-warehouse"
> > anywhere in my code, so its somehow defaulting to that.
> >
> > --
> > Thanks,
> > -Utkarsh
>



-- 
Thanks,
-Utkarsh

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Dragisa Krsmanovic

You are converting DataFrame to Dataset[Entry].

DataFrame is Dataset[Row].

mapPertitions works fine with simple Dataset. Just not with DataFrame.



On Tue, Aug 2, 2016 at 4:50 PM, Ted Yu  wrote:

> Using spark-shell of master branch:
>
> scala> case class Entry(id: Integer, name: String)
> defined class Entry
>
> scala> val df  = Seq((1,"one"), (2, "two")).toDF("id", "name").as[Entry]
> 16/08/02 16:47:01 DEBUG package$ExpressionCanonicalizer:
> === Result of Batch CleanExpressions ===
> !assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
> object)._1 AS _1#10   assertnotnull(input[0, scala.Tuple2, true], top level
> non-flat input object)._1
> !+- assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
> object)._1 +- assertnotnull(input[0, scala.Tuple2, true], top level
> non-flat input object)
> !   +- assertnotnull(input[0, scala.Tuple2, true], top level non-flat
> input object)+- input[0, scala.Tuple2, true]
> !  +- input[0, scala.Tuple2, true]
> ...
>
> scala> df.mapPartitions(_.take(1))
>
> On Tue, Aug 2, 2016 at 1:55 PM, Dragisa Krsmanovic  > wrote:
>
>> I am trying to use mapPartitions on DataFrame.
>>
>> Example:
>>
>> import spark.implicits._
>> val df: DataFrame = Seq((1,"one"), (2, "two")).toDF("id", "name")
>> df.mapPartitions(_.take(1))
>>
>> I am getting:
>>
>> Unable to find encoder for type stored in a Dataset.  Primitive types
>> (Int, String, etc) and Product types (case classes) are supported by
>> importing spark.implicits._  Support for serializing other types will be
>> added in future releases.
>>
>> Since DataFrame is Dataset[Row], I was expecting encoder for Row to be
>> there.
>>
>> What's wrong with my code ?
>>
>>
>> --
>>
>> Dragiša Krsmanović | Platform Engineer | Ticketfly
>>
>> dragi...@ticketfly.com
>>
>> @ticketfly  | ticketfly.com/blog |
>> facebook.com/ticketfly
>>
>
>


-- 

Dragiša Krsmanović | Platform Engineer | Ticketfly

dragi...@ticketfly.com

@ticketfly  | ticketfly.com/blog |
facebook.com/ticketfly

Stop Spark Streaming Jobs

2016-08-02 Thread Pradeep

Hi All,

My streaming job reads data from Kafka. The job is triggered and pushed to 
background with nohup.

What are the recommended ways to stop job either on yarn-client or cluster mode.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Extracting key word from a textual column

2016-08-02 Thread Yong Zhang

Well, if you still want to use windows function for your logic, then you need 
to derive a new column out, like "catalog", and use it as part of grouping 
logic.


Maybe you can use regex for deriving out this new column. The implementation 
needs to depend on your data in "transactiondescription", and regex gives you 
the most powerful way to handle your data.


This is really not a Spark question, but how to you process your logic based on 
the data given.


Yong



From: Mich Talebzadeh 
Sent: Tuesday, August 2, 2016 10:00 AM
To: user @spark
Subject: Extracting key word from a textual column

Hi,

Need some ideas.

Summary:

I am working on a tool to slice and dice the amount of money I have spent so 
far (meaning the whole data sample) on a given retailer so I have a better idea 
of where I am wasting the money

Approach

Downloaded my bank statements from a given account in csv format from inception 
till end of July. Read the data and stored it in ORC table.

I am interested for all bills that I paid using Debit Card ( transactiontype = 
"DEB") that comes out the account directly. Transactiontype is the three 
character code lookup that I download as well.

scala> ll_18740868.printSchema
root
 |-- transactiondate: date (nullable = true)
 |-- transactiontype: string (nullable = true)
 |-- sortcode: string (nullable = true)
 |-- accountnumber: string (nullable = true)
 |-- transactiondescription: string (nullable = true)
 |-- debitamount: double (nullable = true)
 |-- creditamount: double (nullable = true)
 |-- balance: double (nullable = true)

The important fields are transactiondate, transactiontype, 
transactiondescription and debitamount

So using analytics. windowing I can do all sorts of things. For example this 
one gives me the last time I spent money on retailer XYZ and the amount

SELECT *
FROM (
  select transactiondate, transactiondescription, debitamount
  , rank() over (order by transactiondate desc) AS rank
  from accounts.ll_18740868 where transactiondescription like '%XYZ%'
 ) tmp
where rank <= 1

And its equivalent using Windowing in FP

import org.apache.spark.sql.expressions.Window
val wSpec = 
Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
 rank().over(wSpec).as("rank")).filter($"rank"===1).show


+---+--++
|transactiondate|transactiondescription|rank|
+---+--++
| 2015-12-15|  XYZ LTD CD 4636 |   1|
+---+--++

So far so good. But if I want to find all I spent on each retailer, then it 
gets trickier as a retailer appears like below in the column 
transactiondescription:

ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
+--+
|transactiondescription|
+--+
|  SAINSBURYS SMKT C...|
|  SACAT SAINSBURYS ...|
|  SAINSBURY'S SMKT ...|
|  SAINSBURYS S/MKT ...|
|  SACAT SAINSBURYS ...|
+--+

If I look at them I know they all belong to SAINBURYS (food retailer). I have 
done some crude grouping and it works somehow

//define UDF here to handle substring
val SubstrUDF = udf { (s: String, start: Int, end: Int) => s.substring(start, 
end) }
var cutoff = "CD"  // currently used in the statement
val wSpec2 = 
Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
 cutoff)-1))
ll_18740868.where($"transactiontype" === "DEB" && 
($"transactiondescription").isNotNull).select(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
 
cutoff)-1).as("Retailer"),sum($"debitamount").over(wSpec2).as("Spent")).distinct.orderBy($"Spent").collect.foreach(println)

However, I really need to extract the "keyword" retailer name from 
transactiondescription column And I need some ideas about the best way of doing 
it. Is this possible in Spark?


Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

Fwd: Saving input schema along with PipelineModel

2016-08-02 Thread Satya Narayan1

Hi All,

Is there any way I can save Input schema along with ml PipelineModel object?
This feature will be really helpful while loading the model and running
transform, as user can get back the schema , prepare the dataset for
model.transform and don't need to remember it.

I see below jira talks about this as one of the update, but I am not able
to get any sub-task for the same(also it is marked as resolved).
https://issues.apache.org/jira/browse/SPARK-6725


"*UPDATE*: In spark.ml, we could save feature metadata using DataFrames.
Other libraries and formats can support this, and it would be great if we
could too. We could do either of the following:

   - save() optionally takes a dataset (or schema), and load will return a
   (model, schema) pair.
   - Models themselves save the input schema.

Both options would mean inheriting from new Saveable, Loadable types."

Please let me know if any update or jira on this.


Thanks,
Satya




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fwd-Saving-input-schema-along-with-PipelineModel-tp27450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Saving input schema along with PipelineModel

2016-08-02 Thread Satyanarayan Patel

Hi All,

Is there any way I can save Input schema along with ml PipelineModel object?
This feature will be really helpful while loading the model and running
transform, as user can get back the schema , prepare the dataset for
model.transform and don't need to remember it.

I see below jira talks about this as one of the update, but I am not able
to get any sub-task for the same(also it is marked as resolved).
https://issues.apache.org/jira/browse/SPARK-6725


"*UPDATE*: In spark.ml, we could save feature metadata using DataFrames.
Other libraries and formats can support this, and it would be great if we
could too. We could do either of the following:

   - save() optionally takes a dataset (or schema), and load will return a
   (model, schema) pair.
   - Models themselves save the input schema.

Both options would mean inheriting from new Saveable, Loadable types."

Please let me know if any update or jira on this.


Thanks,
Satya

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke

I agree with you.

> On 03 Aug 2016, at 01:20, ayan guha  wrote:
> 
> I would stay away from transaction tables until they are fully baked. I do 
> not see why you need to update vs keep inserting with timestamp and while 
> joining derive latest value on the fly.
> 
> But I guess it has became a religious question now :) and I am not unbiased.
> 
>> On 3 Aug 2016 08:51, "Mich Talebzadeh"  wrote:
>> There are many ways of addressing this issue.
>> 
>> Using Hbase with Phoenix adds another layer to the stack which is not 
>> necessary for handful of table and will add to cost (someone else has to 
>> know about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase 
>> table. It is faster)
>> 
>> There may be say 100 new entries into this catalog table with multiple 
>> updates (not a single DML) to get hashtag right. sometimes it is an 
>> iterative process which results in many deltas.
>> 
>> If that is needed done once a day or on demand, an alternative would be to 
>> insert overwrite the transactional hive table with deltas into a text table 
>> in Hive and present that one to Spark. This allows Spark to see the data.
>> 
>> Remember if I use Hive to do the analytics/windowing, there is no issue. The 
>> issue is with Spark that neither Spark SQL or Spark shell can use that table.
>> 
>> Sounds like an issue for Spark to resolve later.
>> 
>> Another alternative one can leave the transactional table in RDBMS for this 
>> purpose and load it into DF through JDBC interface. It works fine and pretty 
>> fast.
>> 
>> Again these are all workarounds. I discussed this in Hive forum. There 
>> should be a way" to manually compact a transactional table in Hive" (not 
>> possible now) and second point if Hive can see the data in Hive table, why 
>> not Spark?
>> 
>> HTH
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 23:10, Ted Yu  wrote:
>>> +1
>>> 
 On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
 
 If you need to use single inserts, updates, deletes, select why not use 
 hbase with Phoenix? I see it as complementary to the hive / warehouse 
 offering 
 
> On 02 Aug 2016, at 22:34, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> I decided to create a catalog table in Hive ORC and transactional. That 
> table has two columns of value
> 
> transactiondescription === account_table.transactiondescription
> hashtag String column created from a semi automated process of deriving 
> it from account_table.transactiondescription
> Once the process is complete in populating the catalog table then we just 
> need to create a new DF based on join between catalog table and the 
> account_table. The join will use hashtag in catalog table to loop over 
> debit column in account_table for a given hashtag. That is pretty fast as 
> going through pattern matching is pretty intensive in any application and 
> database in real time.
> 
> So one can build up the catalog table over time as a reference table. I 
> am sure such tables exist in commercial world.
> 
> Anyway after getting results out I know how I am wasting my money on 
> different things, especially on clothing  etc :)
> 
> 
> HTH
> 
> P.S. Also there is an issue with Spark not being able to read data 
> through Hive transactional tables that have not been compacted yet. Spark 
> just crashes. If these tables need to be updated regularly say catalog 
> table and they are pretty small, one might maintain them in an RDBMS and 
> read them once through JDBC into a DataFrame in Spark before doing 
> analytics.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any 
> loss, damage or destruction of data or any other property which may arise 
> from relying on this email's technical content is explicitly disclaimed. 
> The author will in no case be liable for any monetary damages arising 
> from such loss, damage or destruction.
>  
> 
>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>> Hi Mich,
>> 
>> It

Re: Extracting key word from a textual column

2016-08-02 Thread Jörn Franke

Phoenix will become another standard query interface of hbase. I do not agree 
that using hbase directly will lead to a faster performance. It always depends 
how you use it. While it is another component, it can make sense to use it. 
This has to be evaluated on a case by case basis. 
If you only want to use Hive you can do the manual compaction as described in 
another answer to this thread.

I anyway recommend most of the people to use a Hadoop distribution where all 
these components are properly integrate and you get support.

> On 03 Aug 2016, at 00:51, Mich Talebzadeh  wrote:
> 
> There are many ways of addressing this issue.
> 
> Using Hbase with Phoenix adds another layer to the stack which is not 
> necessary for handful of table and will add to cost (someone else has to know 
> about Hbase, Phoenix etc. (BTW I would rather work directly on Hbase table. 
> It is faster)
> 
> There may be say 100 new entries into this catalog table with multiple 
> updates (not a single DML) to get hashtag right. sometimes it is an iterative 
> process which results in many deltas.
> 
> If that is needed done once a day or on demand, an alternative would be to 
> insert overwrite the transactional hive table with deltas into a text table 
> in Hive and present that one to Spark. This allows Spark to see the data.
> 
> Remember if I use Hive to do the analytics/windowing, there is no issue. The 
> issue is with Spark that neither Spark SQL or Spark shell can use that table.
> 
> Sounds like an issue for Spark to resolve later.
> 
> Another alternative one can leave the transactional table in RDBMS for this 
> purpose and load it into DF through JDBC interface. It works fine and pretty 
> fast.
> 
> Again these are all workarounds. I discussed this in Hive forum. There should 
> be a way" to manually compact a transactional table in Hive" (not possible 
> now) and second point if Hive can see the data in Hive table, why not Spark?
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 2 August 2016 at 23:10, Ted Yu  wrote:
>> +1
>> 
>>> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
>>> 
>>> If you need to use single inserts, updates, deletes, select why not use 
>>> hbase with Phoenix? I see it as complementary to the hive / warehouse 
>>> offering 
>>> 
 On 02 Aug 2016, at 22:34, Mich Talebzadeh  
 wrote:

 Hi,

 I decided to create a catalog table in Hive ORC and transactional. That 
 table has two columns of value

 transactiondescription === account_table.transactiondescription
 hashtag String column created from a semi automated process of deriving it 
 from account_table.transactiondescription
 Once the process is complete in populating the catalog table then we just 
 need to create a new DF based on join between catalog table and the 
 account_table. The join will use hashtag in catalog table to loop over 
 debit column in account_table for a given hashtag. That is pretty fast as 
 going through pattern matching is pretty intensive in any application and 
 database in real time.

 So one can build up the catalog table over time as a reference table. I am 
 sure such tables exist in commercial world.

 Anyway after getting results out I know how I am wasting my money on 
 different things, especially on clothing  etc :)

 HTH

 P.S. Also there is an issue with Spark not being able to read data through 
 Hive transactional tables that have not been compacted yet. Spark just 
 crashes. If these tables need to be updated regularly say catalog table 
 and they are pretty small, one might maintain them in an RDBMS and read 
 them once through JDBC into a DataFrame in Spark before doing analytics.

 Dr Mich Talebzadeh

 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 http://talebzadehmich.wordpress.com

 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.

> On 2 August 2016 at 17:56, Sonal Goyal

Spark on yarn, only 1 or 2 vcores getting allocated to the containers getting created.

2016-08-02 Thread satyajit vegesna

Hi All,

I am trying to run a spark job using yarn, and i specify --executor-cores
value as 20.
But when i go check the "nodes of the cluster" page in
http://hostname:8088/cluster/nodes then i see 4 containers getting created
on each of the node in cluster.

But can only see 1 vcore getting assigned for each containier, even when i
specify --executor-cores 20 while submitting job using spark-submit.

yarn-site.xml

yarn.scheduler.maximum-allocation-mb
6


yarn.scheduler.minimum-allocation-vcores
1


yarn.scheduler.maximum-allocation-vcores
40


yarn.nodemanager.resource.memory-mb
7


yarn.nodemanager.resource.cpu-vcores
20



Did anyone face the same issue??

Regards,
Satyajit.

Re: calling dataset.show on a custom object - displays toString() value as first column and blank for rest

On Sun, Jul 31, 2016 at 4:16 PM, Rohit Chaddha
 wrote:
> I have a custom object called A and corresponding Dataset
>
> when I call datasetA.show() method i get the following

How do you create datasetA? How does A look like?

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: Stop Spark Streaming Jobs

2016-08-02 Thread Park Kyeong Hee

Hi. Paradeep


Did you mean, how to kill the job?
If yes, you should kill the driver and follow next.

on yarn-client
1. find pid - "ps -es | grep "
2. kill it - "kill -9 "
3. check executors were down - "yarn application -list"

on yarn-cluster
1. find driver's application ID - "yarn application -list"
2. stop it - "yarn application -kill "
3. check driver and executors were down - "yarn application -list"


Thanks.

-Original Message-
From: Pradeep [mailto:pradeep.mi...@mail.com] 
Sent: Wednesday, August 03, 2016 10:48 AM
To: user@spark.apache.org
Subject: Stop Spark Streaming Jobs

Hi All,

My streaming job reads data from Kafka. The job is triggered and pushed to
background with nohup.

What are the recommended ways to stop job either on yarn-client or cluster
mode.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

FW: [jupyter] newbie. apache spark python3 'Jupyter' data frame problem with auto completion and accessing documentation

2016-08-02 Thread Andy Davidson

FYI

From:   on behalf of Thomas Kluyver

Reply-To:  
Date:  Tuesday, August 2, 2016 at 3:26 AM
To:  Project Jupyter 
Subject:  Re: [jupyter] newbie. apache spark python3 'Jupyter' data frame
problem with auto completion and accessing documentation

> Hi Andy,
> 
> On 1 August 2016 at 22:46, Andy Davidson 
> wrote:
>> I wonder if the auto completion problem as todo with the way code is
>> typically written in spark. I.E. By creating chains of function alls? Each
>> function returns a data frame.
>> 
>> hashTagsDF.groupBy("tag").agg({"tag": "count"}).orderBy("count(tag)",
>> ascending=False).show()
> 
> At the moment, tab completion isn't smart enough to know what the result of
> calling a function or method is, so it won't complete anything after the first
> groupBy there. This is something we hope to improve for IPython 6.
> 
> Thomas
> 
> -- 
> You received this message because you are subscribed to the Google Groups
> "Project Jupyter" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jupyter+unsubscr...@googlegroups.com.
> To post to this group, send email to jupy...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jupyter/CAOvn4qiGbH_m-TqsSKA6ou-DenwJXJ9Z%2B
> xNXSQzV2u7Q5-6HhA%40mail.gmail.com
>  BxNXSQzV2u7Q5-6HhA%40mail.gmail.com?utm_medium=email_source=footer> .
> For more options, visit https://groups.google.com/d/optout.

Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread Liangzhao Zeng

Hi,


I migrate my code to Spark 2.0 from 1.6. It finish last stage (and
result is correct) but get following errors then start over.


Any idea on what happen?


16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus
has already stopped! Dropping event
SparkListenerExecutorMetricsUpdate(2,WrappedArray())
16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus
has already stopped! Dropping event
SparkListenerExecutorMetricsUpdate(115,WrappedArray())
16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus
has already stopped! Dropping event
SparkListenerExecutorMetricsUpdate(70,WrappedArray())
16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending
requests, but found none.
16/08/02 16:59:33 WARN netty.Dispatcher: Message
RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find
MapOutputTracker.



Cheers,


LZ

Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread freedafeng

Solution:
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", "...") 
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", "...") 

Got this solution from a cloudera lady. Thanks Neerja.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Extracting key word from a textual column

2016-08-02 Thread Sonal Goyal

Hi Mich,

It seems like an entity resolution problem - looking at different
representations of an entity - SAINSBURY in this case and matching them all
together. How dirty is your data in the description - are there stop words
like SACAT/SMKT etc you can strip off and get the base retailer entity ?

Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015






On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
wrote:

> Thanks.
>
> I believe there is some catalog of companies that I can get and store it
> in a table and math the company name to transactiondesciption column.
>
> That catalog should have sectors in it. For example company XYZ is under
> Grocers etc which will make search and grouping much easier.
>
> I believe Spark can do it, though I am generally interested on alternative
> ideas.
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 16:26, Yong Zhang  wrote:
>
>> Well, if you still want to use windows function for your logic, then you
>> need to derive a new column out, like "catalog", and use it as part of
>> grouping logic.
>>
>>
>> Maybe you can use regex for deriving out this new column. The
>> implementation needs to depend on your data in "transactiondescription",
>> and regex gives you the most powerful way to handle your data.
>>
>>
>> This is really not a Spark question, but how to you process your logic
>> based on the data given.
>>
>>
>> Yong
>>
>>
>> --
>> *From:* Mich Talebzadeh 
>> *Sent:* Tuesday, August 2, 2016 10:00 AM
>> *To:* user @spark
>> *Subject:* Extracting key word from a textual column
>>
>> Hi,
>>
>> Need some ideas.
>>
>> *Summary:*
>>
>> I am working on a tool to slice and dice the amount of money I have spent
>> so far (meaning the whole data sample) on a given retailer so I have a
>> better idea of where I am wasting the money
>>
>> *Approach*
>>
>> Downloaded my bank statements from a given account in csv format from
>> inception till end of July. Read the data and stored it in ORC table.
>>
>> I am interested for all bills that I paid using Debit Card (
>> transactiontype = "DEB") that comes out the account directly.
>> Transactiontype is the three character code lookup that I download as well.
>>
>> scala> ll_18740868.printSchema
>> root
>>  |-- transactiondate: date (nullable = true)
>>  |-- transactiontype: string (nullable = true)
>>  |-- sortcode: string (nullable = true)
>>  |-- accountnumber: string (nullable = true)
>>  |-- transactiondescription: string (nullable = true)
>>  |-- debitamount: double (nullable = true)
>>  |-- creditamount: double (nullable = true)
>>  |-- balance: double (nullable = true)
>>
>> The important fields are transactiondate, transactiontype,
>> transactiondescription and debitamount
>>
>> So using analytics. windowing I can do all sorts of things. For example
>> this one gives me the last time I spent money on retailer XYZ and the amount
>>
>> SELECT *
>> FROM (
>>   select transactiondate, transactiondescription, debitamount
>>   , rank() over (order by transactiondate desc) AS rank
>>   from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>>  ) tmp
>> where rank <= 1
>>
>> And its equivalent using Windowing in FP
>>
>> import org.apache.spark.sql.expressions.Window
>> val wSpec =
>> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
>> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
>> rank().over(wSpec).as("rank")).filter($"rank"===1).show
>>
>>
>> +---+--++
>> |transactiondate|transactiondescription|rank|
>> +---+--++
>> | 2015-12-15|  XYZ LTD CD 4636 |   1|
>> +---+--++
>>
>> So far so good. But if I want to find all I spent on each retailer, then
>> it gets trickier as a retailer appears like below in the column
>> transactiondescription:
>>
>>
>> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
>>

Re: Extracting key word from a textual column

Thanks.

I believe there is some catalog of companies that I can get and store it in
a table and math the company name to transactiondesciption column.

That catalog should have sectors in it. For example company XYZ is under
Grocers etc which will make search and grouping much easier.

I believe Spark can do it, though I am generally interested on alternative
ideas.





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 16:26, Yong Zhang  wrote:

> Well, if you still want to use windows function for your logic, then you
> need to derive a new column out, like "catalog", and use it as part of
> grouping logic.
>
>
> Maybe you can use regex for deriving out this new column. The
> implementation needs to depend on your data in "transactiondescription",
> and regex gives you the most powerful way to handle your data.
>
>
> This is really not a Spark question, but how to you process your logic
> based on the data given.
>
>
> Yong
>
>
> --
> *From:* Mich Talebzadeh 
> *Sent:* Tuesday, August 2, 2016 10:00 AM
> *To:* user @spark
> *Subject:* Extracting key word from a textual column
>
> Hi,
>
> Need some ideas.
>
> *Summary:*
>
> I am working on a tool to slice and dice the amount of money I have spent
> so far (meaning the whole data sample) on a given retailer so I have a
> better idea of where I am wasting the money
>
> *Approach*
>
> Downloaded my bank statements from a given account in csv format from
> inception till end of July. Read the data and stored it in ORC table.
>
> I am interested for all bills that I paid using Debit Card (
> transactiontype = "DEB") that comes out the account directly.
> Transactiontype is the three character code lookup that I download as well.
>
> scala> ll_18740868.printSchema
> root
>  |-- transactiondate: date (nullable = true)
>  |-- transactiontype: string (nullable = true)
>  |-- sortcode: string (nullable = true)
>  |-- accountnumber: string (nullable = true)
>  |-- transactiondescription: string (nullable = true)
>  |-- debitamount: double (nullable = true)
>  |-- creditamount: double (nullable = true)
>  |-- balance: double (nullable = true)
>
> The important fields are transactiondate, transactiontype,
> transactiondescription and debitamount
>
> So using analytics. windowing I can do all sorts of things. For example
> this one gives me the last time I spent money on retailer XYZ and the amount
>
> SELECT *
> FROM (
>   select transactiondate, transactiondescription, debitamount
>   , rank() over (order by transactiondate desc) AS rank
>   from accounts.ll_18740868 where transactiondescription like '%XYZ%'
>  ) tmp
> where rank <= 1
>
> And its equivalent using Windowing in FP
>
> import org.apache.spark.sql.expressions.Window
> val wSpec =
> Window.partitionBy("transactiontype").orderBy(desc("transactiondate"))
> ll_18740868.filter(col("transactiondescription").contains("XYZ")).select($"transactiondate",$"transactiondescription",
> rank().over(wSpec).as("rank")).filter($"rank"===1).show
>
>
> +---+--++
> |transactiondate|transactiondescription|rank|
> +---+--++
> | 2015-12-15|  XYZ LTD CD 4636 |   1|
> +---+--++
>
> So far so good. But if I want to find all I spent on each retailer, then
> it gets trickier as a retailer appears like below in the column
> transactiondescription:
>
>
> ll_18740868.where($"transactiondescription".contains("SAINSBURY")).select($"transactiondescription").show(5)
> +--+
> |transactiondescription|
> +--+
> |  SAINSBURYS SMKT C...|
> |  SACAT SAINSBURYS ...|
> |  SAINSBURY'S SMKT ...|
> |  SAINSBURYS S/MKT ...|
> |  SACAT SAINSBURYS ...|
> +--+
>
> If I look at them I know they all belong to SAINBURYS (food retailer). I
> have done some crude grouping and it works somehow
>
> //define UDF here to handle substring
> val SubstrUDF = udf { (s: String, start: Int, end: Int) =>
> s.substring(start, end) }
> var cutoff = "CD"  // currently used in the statement
> val wSpec2 =
> Window.partitionBy(SubstrUDF($"transactiondescription",lit(0),instr($"transactiondescription",
> cutoff)-1))
> ll_18740868.where($"transactiontype" === "DEB" &&
>

Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread freedafeng

Any one, please? I believe many of us are using spark 1.6 or higher with
s3... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-error-tp27417p27451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark 1.6.0 read s3 files error.

2016-08-02 Thread Andy Davidson

Hi Freedafeng

I have been reading and writing to s3 using spark-1.6.x with out any
problems. Can you post a little code example and any error messages?

Andy

From:  freedafeng 
Date:  Tuesday, August 2, 2016 at 9:26 AM
To:  "user @spark" 
Subject:  Re: spark 1.6.0 read s3 files error.

> Any one, please? I believe many of us are using spark 1.6 or higher with
> s3... 
> 
> 
> 
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-0-read-s3-files-
> error-tp27417p27451.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
>

Re: Spark GraphFrames

2016-08-02 Thread Denny Lee

Hi Divya,

Here's a blog post concerning On-Time Flight Performance with GraphFrames:
https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html

It also includes a Databricks notebook that has the code in it.

HTH!
Denny


On Tue, Aug 2, 2016 at 1:16 AM Kazuaki Ishizaki  wrote:

> Sorry
> Please ignore this mail. Sorry for misinterpretation of GraphFrame in
> Spark. I thought that Frame Graph for profiling tool.
>
> Kazuaki Ishizaki,
>
>
>
> From:Kazuaki Ishizaki/Japan/IBM@IBMJP
> To:Divya Gehlot 
> Cc:"user @spark" 
> Date:2016/08/02 17:06
> Subject:Re: Spark GraphFrames
> --
>
>
>
> Hi,
> Kay wrote a procedure to use GraphFrames with Spark.
> *https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4*
> 
>
> Kazuaki Ishizaki
>
>
>
> From:Divya Gehlot 
> To:"user @spark" 
> Date:2016/08/02 14:52
> Subject:Spark GraphFrames
> --
>
>
>
> Hi,
>
> Has anybody has worked with GraphFrames.
> Pls let me know as I need to know the real case scenarios where It can
> used .
>
>
> Thanks,
> Divya
>
>
>

Re: What are using Spark for

2016-08-02 Thread Karthik Ramakrishnan

We used Storm for ETL, now currently thinking Spark might be advantageous
since some ML also is coming our way.

- Karthik

On Tue, Aug 2, 2016 at 1:10 PM, Rohit L  wrote:

> Does anyone use Spark for ETL?
>
> On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal  wrote:
>
>> Hi Rohit,
>>
>> You can check the powered by spark page for some real usage of Spark.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
>>
>>
>> On Tuesday, August 2, 2016, Rohit L  wrote:
>>
>>> Hi Everyone,
>>>
>>>   I want to know the real world uses cases for which Spark is used
>>> and hence can you please share for what purpose you are using Apache Spark
>>> in your project?
>>>
>>> --
>>> Rohit
>>>
>>
>>
>> --
>> Best Regards,
>> Sonal
>> Founder, Nube Technologies 
>> Reifier at Strata Hadoop World
>> 
>> Reifier at Spark Summit 2015
>> 
>>
>> 
>>
>>
>>
>>
>

Re: What are using Spark for

2016-08-02 Thread Rohit L

Does anyone use Spark for ETL?

On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal  wrote:

> Hi Rohit,
>
> You can check the powered by spark page for some real usage of Spark.
>
> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
>
>
> On Tuesday, August 2, 2016, Rohit L  wrote:
>
>> Hi Everyone,
>>
>>   I want to know the real world uses cases for which Spark is used
>> and hence can you please share for what purpose you are using Apache Spark
>> in your project?
>>
>> --
>> Rohit
>>
>
>
> --
> Best Regards,
> Sonal
> Founder, Nube Technologies 
> Reifier at Strata Hadoop World
> 
> Reifier at Spark Summit 2015
> 
>
> 
>
>
>
>

Re: What are using Spark for

2016-08-02 Thread Deepak Sharma

Yes.I am using spark for ETL and I am sure there are lot of other companies
who are using spark for ETL.

Thanks
Deepak

On 2 Aug 2016 11:40 pm, "Rohit L"  wrote:

> Does anyone use Spark for ETL?
>
> On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal  wrote:
>
>> Hi Rohit,
>>
>> You can check the powered by spark page for some real usage of Spark.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
>>
>>
>> On Tuesday, August 2, 2016, Rohit L  wrote:
>>
>>> Hi Everyone,
>>>
>>>   I want to know the real world uses cases for which Spark is used
>>> and hence can you please share for what purpose you are using Apache Spark
>>> in your project?
>>>
>>> --
>>> Rohit
>>>
>>
>>
>> --
>> Best Regards,
>> Sonal
>> Founder, Nube Technologies 
>> Reifier at Strata Hadoop World
>> 
>> Reifier at Spark Summit 2015
>> 
>>
>> 
>>
>>
>>
>>
>

Re: Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread Ted Yu

Which hadoop version are you using ?

Can you show snippet of your code ?

Thanks

On Tue, Aug 2, 2016 at 10:06 AM, Liangzhao Zeng 
wrote:

> Hi,
>
>
> I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result is 
> correct) but get following errors then start over.
>
>
> Any idea on what happen?
>
>
> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(2,WrappedArray())
> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(115,WrappedArray())
> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(70,WrappedArray())
> 16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending requests, 
> but found none.
> 16/08/02 16:59:33 WARN netty.Dispatcher: Message 
> RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find 
> MapOutputTracker.
>
>
>
> Cheers,
>
>
> LZ
>
>

Re: Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread dhruve ashar

Hi LZ,

Getting those error messages in logs is normal behavior. When the job
completes, it shuts down the SparkListenerBus as there's no need of
relaying any spark events to the interested registered listeners. So trying
to add events from executors which are yet to shutdown, logs the error
message because the event listener bus has stopped.

"I migrate my code to Spark 2.0 from 1.6. It finish last stage (and
result is correct) but get following errors then start over."

Can you elaborate on what do you mean by start over?

-Dhruve

On Tue, Aug 2, 2016 at 1:01 PM, Ted Yu  wrote:

> Which hadoop version are you using ?
>
> Can you show snippet of your code ?
>
> Thanks
>
> On Tue, Aug 2, 2016 at 10:06 AM, Liangzhao Zeng 
> wrote:
>
>> Hi,
>>
>>
>> I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result is 
>> correct) but get following errors then start over.
>>
>>
>> Any idea on what happen?
>>
>>
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(2,WrappedArray())
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(115,WrappedArray())
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(70,WrappedArray())
>> 16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending 
>> requests, but found none.
>> 16/08/02 16:59:33 WARN netty.Dispatcher: Message 
>> RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find 
>> MapOutputTracker.
>>
>>
>>
>> Cheers,
>>
>>
>> LZ
>>
>>
>

-- 
-Dhruve Ashar

Re: Spark 2.0 History Server Storage

2016-08-02 Thread Andrei Ivanov

OK, answering myself - this is broken since 1.6.2 by SPARK-13845


On Tue, Aug 2, 2016 at 12:10 AM, Andrei Ivanov  wrote:

> Hi all,
>
> I've just tried upgrading Spark to 2.0 and so far it looks generally good.
>
> But there is at least one issue I see right away - jon histories are
> missing storage information (persisted RRDs).
> This info is also missing from pre upgrade jobs.
>
> Does anyone have a clue what can be wrong?
>
> Thanks, Andrei Ivanov.
>

Re: Spark 2.0 History Server Storage

2016-08-02 Thread Andrei Ivanov

   1. SPARK-16859 
submitted


On Tue, Aug 2, 2016 at 9:07 PM, Andrei Ivanov  wrote:

> OK, answering myself - this is broken since 1.6.2 by SPARK-13845
> 
>
> On Tue, Aug 2, 2016 at 12:10 AM, Andrei Ivanov 
> wrote:
>
>> Hi all,
>>
>> I've just tried upgrading Spark to 2.0 and so far it looks generally good.
>>
>> But there is at least one issue I see right away - jon histories are
>> missing storage information (persisted RRDs).
>> This info is also missing from pre upgrade jobs.
>>
>> Does anyone have a clue what can be wrong?
>>
>> Thanks, Andrei Ivanov.
>>
>
>

Re: Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread Liangzhao Zeng

It try to execute the job again, from the first stage. 

发自我的 iPhone

> 在 Aug 2, 2016，11:24 AM，dhruve ashar  写道：
> 
> Hi LZ,
> 
> Getting those error messages in logs is normal behavior. When the job 
> completes, it shuts down the SparkListenerBus as there's no need of relaying 
> any spark events to the interested registered listeners. So trying to add 
> events from executors which are yet to shutdown, logs the error message 
> because the event listener bus has stopped.
> 
> "I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result is 
> correct) but get following errors then start over." 
> 
> Can you elaborate on what do you mean by start over?
> 
> -Dhruve
> 
> 
> 
>> On Tue, Aug 2, 2016 at 1:01 PM, Ted Yu  wrote:
>> Which hadoop version are you using ?
>> 
>> Can you show snippet of your code ?
>> 
>> Thanks
>> 
>>> On Tue, Aug 2, 2016 at 10:06 AM, Liangzhao Zeng  
>>> wrote:
>>> Hi, 
>>> 
>>> I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result 
>>> is correct) but get following errors then start over. 
>>> 
>>> Any idea on what happen?
>>> 
>>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>>> already stopped! Dropping event 
>>> SparkListenerExecutorMetricsUpdate(2,WrappedArray())
>>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>>> already stopped! Dropping event 
>>> SparkListenerExecutorMetricsUpdate(115,WrappedArray())
>>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>>> already stopped! Dropping event 
>>> SparkListenerExecutorMetricsUpdate(70,WrappedArray())
>>> 16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending 
>>> requests, but found none.
>>> 16/08/02 16:59:33 WARN netty.Dispatcher: Message 
>>> RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find 
>>> MapOutputTracker.
>>> 
>>> 
>>> Cheers,
>>> 
>>> LZ
> 
> 
> 
> -- 
> -Dhruve Ashar
>

Re: What are using Spark for

Hi,

If I may say, if you spend  sometime going through this mailing list in
this forum and see the variety of topics that users are discussing, then
you may get plenty of ideas about Spark application in real life..

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 2 August 2016 at 19:17, Karthik Ramakrishnan <
karthik.ramakrishna...@gmail.com> wrote:

> We used Storm for ETL, now currently thinking Spark might be advantageous
> since some ML also is coming our way.
>
> - Karthik
>
> On Tue, Aug 2, 2016 at 1:10 PM, Rohit L  wrote:
>
>> Does anyone use Spark for ETL?
>>
>> On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal 
>> wrote:
>>
>>> Hi Rohit,
>>>
>>> You can check the powered by spark page for some real usage of Spark.
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
>>>
>>>
>>> On Tuesday, August 2, 2016, Rohit L  wrote:
>>>
 Hi Everyone,

   I want to know the real world uses cases for which Spark is used
 and hence can you please share for what purpose you are using Apache Spark
 in your project?

 --
 Rohit

>>>
>>>
>>> --
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies 
>>> Reifier at Strata Hadoop World
>>> 
>>> Reifier at Spark Summit 2015
>>> 
>>>
>>> 
>>>
>>>
>>>
>>>
>>
>
>
>
>

Re: Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread dhruve ashar

Can you provide additional logs.

On Tue, Aug 2, 2016 at 2:13 PM, Liangzhao Zeng 
wrote:

> It is 2.6 and code is very simple. I load data file from Hdfs to create
> rdd then same some samples.
>
>
> Thanks
>
> 发自我的 iPhone
>
> 在 Aug 2, 2016，11:01 AM，Ted Yu  写道：
>
> Which hadoop version are you using ?
>
> Can you show snippet of your code ?
>
> Thanks
>
> On Tue, Aug 2, 2016 at 10:06 AM, Liangzhao Zeng 
> wrote:
>
>> Hi,
>>
>>
>> I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result is 
>> correct) but get following errors then start over.
>>
>>
>> Any idea on what happen?
>>
>>
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(2,WrappedArray())
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(115,WrappedArray())
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(70,WrappedArray())
>> 16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending 
>> requests, but found none.
>> 16/08/02 16:59:33 WARN netty.Dispatcher: Message 
>> RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find 
>> MapOutputTracker.
>>
>>
>>
>> Cheers,
>>
>>
>> LZ
>>
>>
>


-- 
-Dhruve Ashar

Re: What are using Spark for

2016-08-02 Thread Daniel Siegmann

Yes, you can use Spark for ETL, as well as feature engineering, training,
and scoring.

~Daniel Siegmann

On Tue, Aug 2, 2016 at 3:29 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> If I may say, if you spend  sometime going through this mailing list in
> this forum and see the variety of topics that users are discussing, then
> you may get plenty of ideas about Spark application in real life..
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 2 August 2016 at 19:17, Karthik Ramakrishnan <
> karthik.ramakrishna...@gmail.com> wrote:
>
>> We used Storm for ETL, now currently thinking Spark might be advantageous
>> since some ML also is coming our way.
>>
>> - Karthik
>>
>> On Tue, Aug 2, 2016 at 1:10 PM, Rohit L  wrote:
>>
>>> Does anyone use Spark for ETL?
>>>
>>> On Tue, Aug 2, 2016 at 1:24 PM, Sonal Goyal 
>>> wrote:
>>>
 Hi Rohit,

 You can check the powered by spark page for some real usage of Spark.

 https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark


 On Tuesday, August 2, 2016, Rohit L  wrote:

> Hi Everyone,
>
>   I want to know the real world uses cases for which Spark is used
> and hence can you please share for what purpose you are using Apache Spark
> in your project?
>
> --
> Rohit
>


 --
 Best Regards,
 Sonal
 Founder, Nube Technologies 
 Reifier at Strata Hadoop World
 
 Reifier at Spark Summit 2015
 

 




>>>
>>
>>
>>
>>
>

saving data frame to optimize joins at a later time

2016-08-02 Thread Cesar

Hi all:

I wonder if there is a way to save a table in order to optimize join at a
later time.

For example if I do something like:

val df = anotherDF.repartition("id")//some data frame
df.registerTempTable("tableAlias")

hiveContext.sql(
  "INSERT INTO whse.someTable
   SELECT * FROM tableAlias
 "
)

Do the partition information ("id") will be stored in whse.someTable such
that when querying on that table in a second spark job, the information
will be used for optimizing joins for example?

If this approach do not work, can you suggest one that works?


Thanks
-- 
Cesar Flores

Re: Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread Liangzhao Zeng

It is 2.6 and code is very simple. I load data file from Hdfs to create rdd 
then same some samples.


Thanks 

发自我的 iPhone

> 在 Aug 2, 2016，11:01 AM，Ted Yu  写道：
> 
> Which hadoop version are you using ?
> 
> Can you show snippet of your code ?
> 
> Thanks
> 
>> On Tue, Aug 2, 2016 at 10:06 AM, Liangzhao Zeng  
>> wrote:
>> Hi, 
>> 
>> I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result is 
>> correct) but get following errors then start over. 
>> 
>> Any idea on what happen?
>> 
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(2,WrappedArray())
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(115,WrappedArray())
>> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
>> already stopped! Dropping event 
>> SparkListenerExecutorMetricsUpdate(70,WrappedArray())
>> 16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending 
>> requests, but found none.
>> 16/08/02 16:59:33 WARN netty.Dispatcher: Message 
>> RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find 
>> MapOutputTracker.
>> 
>> 
>> Cheers,
>> 
>> LZ
>

In 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?

2016-08-02 Thread pgb

I'm interested in learning if it's possible to grab the results set from a
query run on an external database as opposed to grabbing the full table and
manipulating it later. The base code I'm working with is below (using Spark
2.0.0):

```
from pyspark.sql import SparkSession

df = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost:port")\
.option("dbtable", "schema.tablename")\
.option("user", "username")\
.option("password", "password")\
.load()
```

Many thanks,
Pat



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/In-2-0-0-is-it-possible-to-fetch-a-query-from-an-external-database-rather-than-grab-the-whole-table-tp27453.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Extracting key word from a textual column

Hi,

I decided to create a catalog table in Hive ORC and transactional. That
table has two columns of value

   1. transactiondescription === account_table.transactiondescription
   2. hashtag String column created from a semi automated process of
   deriving it from account_table.transactiondescription

Once the process is complete in populating the catalog table then we just
need to create a new DF based on join between catalog table and the
account_table. The join will use hashtag in catalog table to loop over
debit column in account_table for a given hashtag. That is pretty fast as
going through pattern matching is pretty intensive in any application and
database in real time.

So one can build up the catalog table over time as a reference table. I am
sure such tables exist in commercial world.

Anyway after getting results out I know how I am wasting my money on
different things, especially on clothing  etc :)

HTH

P.S. Also there is an issue with Spark not being able to read data through
Hive transactional tables that have not been compacted yet. Spark just
crashes. If these tables need to be updated regularly say catalog table and
they are pretty small, one might maintain them in an RDBMS and read them
once through JDBC into a DataFrame in Spark before doing analytics.

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 2 August 2016 at 17:56, Sonal Goyal  wrote:

> Hi Mich,
>
> It seems like an entity resolution problem - looking at different
> representations of an entity - SAINSBURY in this case and matching them all
> together. How dirty is your data in the description - are there stop words
> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>
> Best Regards,
> Sonal
> Founder, Nube Technologies 
> Reifier at Strata Hadoop World
> 
> Reifier at Spark Summit 2015
> 
>
> 
>
>
>
> On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh  > wrote:
>
>> Thanks.
>>
>> I believe there is some catalog of companies that I can get and store it
>> in a table and math the company name to transactiondesciption column.
>>
>> That catalog should have sectors in it. For example company XYZ is under
>> Grocers etc which will make search and grouping much easier.
>>
>> I believe Spark can do it, though I am generally interested on
>> alternative ideas.
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 2 August 2016 at 16:26, Yong Zhang  wrote:
>>
>>> Well, if you still want to use windows function for your logic, then you
>>> need to derive a new column out, like "catalog", and use it as part of
>>> grouping logic.
>>>
>>>
>>> Maybe you can use regex for deriving out this new column. The
>>> implementation needs to depend on your data in "transactiondescription",
>>> and regex gives you the most powerful way to handle your data.
>>>
>>>
>>> This is really not a Spark question, but how to you process your logic
>>> based on the data given.
>>>
>>>
>>> Yong
>>>
>>>
>>> --
>>> *From:* Mich Talebzadeh 
>>> *Sent:* Tuesday, August 2, 2016 10:00 AM
>>> *To:* user @spark
>>> *Subject:* Extracting key word from a textual column
>>>
>>> Hi,
>>>
>>> Need some ideas.
>>>
>>> *Summary:*
>>>
>>> I am working on a tool to slice and dice the amount of money I have
>>> spent so far (meaning the whole data sample) on a given retailer so I have
>>> a better idea of where I am wasting the money
>>>
>>> *Approach*
>>>
>>> Downloaded my bank statements from a given account in csv format from
>>> inception till end of July. Read the data and stored it in ORC table.
>>>
>>> I am interested for

Re: decribe function limit of columns

2016-08-02 Thread janardhan shetty

If you are referring to limit the # of columns you can select the columns
and describe.
df.select("col1", "col2").describe().show()

On Tue, Aug 2, 2016 at 6:39 AM, pseudo oduesp  wrote:

> Hi
>  in spark 1.5.0  i used  descibe function with more than 100 columns .
> someone can tell me if any limit exsiste now ?
>
> thanks
>
>

[2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Dragisa Krsmanovic

I am trying to use mapPartitions on DataFrame.

Example:

import spark.implicits._
val df: DataFrame = Seq((1,"one"), (2, "two")).toDF("id", "name")
df.mapPartitions(_.take(1))

I am getting:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int,
String, etc) and Product types (case classes) are supported by importing
spark.implicits._  Support for serializing other types will be added in
future releases.

Since DataFrame is Dataset[Row], I was expecting encoder for Row to be
there.

What's wrong with my code ?


-- 

Dragiša Krsmanović | Platform Engineer | Ticketfly

dragi...@ticketfly.com

@ticketfly  | ticketfly.com/blog |
facebook.com/ticketfly

Does it has a way to config limit in query on STS by default?

Hi everyone,
I setup STS and use Zeppelin to query data through JDBC connection.
A problem we are facing is users usually forget to put limit in the query so it 
causes hang the cluster. 

SELECT * FROM tableA;

Is there anyway to config the limit by default ?


Regards,
Chanh
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Does it has a way to config limit in query on STS by default?

This is a classic problem on any RDBMS

Set the limit on the number of rows returned like maximum of 50K rows
through JDBC

What is your JDBC connection going to? Meaning which RDBMS if any?

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 2 August 2016 at 08:41, Chanh Le  wrote:

> Hi everyone,
> I setup STS and use Zeppelin to query data through JDBC connection.
> A problem we are facing is users usually forget to put limit in the query
> so it causes hang the cluster.
>
> SELECT * FROM tableA;
>
> Is there anyway to config the limit by default ?
>
>
> Regards,
> Chanh

Re: Spark GraphFrames

2016-08-02 Thread Kazuaki Ishizaki

Hi,
Kay wrote a procedure to use GraphFrames with Spark.
https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4

Kazuaki Ishizaki

From:   Divya Gehlot 
To: "user @spark" 
Date:   2016/08/02 14:52
Subject:Spark GraphFrames

Hi,

Has anybody has worked with GraphFrames.
Pls let me know as I need to know the real case scenarios where It can 
used .

Thanks,
Divya

Re: Does it has a way to config limit in query on STS by default?

Hi Mich,
I use Spark Thrift Server basically it acts like Hive.

I see that there is property in Hive.

> hive.limit.optimize.fetch.max
> Default Value: 5
> Added In: Hive 0.8.0
> Maximum number of rows allowed for a smaller subset of data for simple LIMIT, 
> if it is a fetch query. Insert queries are not restricted by this limit.

Is that related to the problem?




> On Aug 2, 2016, at 2:55 PM, Mich Talebzadeh  wrote:
> 
> This is a classic problem on any RDBMS
> 
> Set the limit on the number of rows returned like maximum of 50K rows through 
> JDBC
> 
> What is your JDBC connection going to? Meaning which RDBMS if any?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 2 August 2016 at 08:41, Chanh Le  > wrote:
> Hi everyone,
> I setup STS and use Zeppelin to query data through JDBC connection.
> A problem we are facing is users usually forget to put limit in the query so 
> it causes hang the cluster.
> 
> SELECT * FROM tableA;
> 
> Is there anyway to config the limit by default ?
> 
> 
> Regards,
> Chanh
>

Re: Spark GraphFrames

2016-08-02 Thread Kazuaki Ishizaki

Sorry
Please ignore this mail. Sorry for misinterpretation of GraphFrame in 
Spark. I thought that Frame Graph for profiling tool.

Kazuaki Ishizaki,

From:   Kazuaki Ishizaki/Japan/IBM@IBMJP
To: Divya Gehlot 
Cc: "user @spark" 
Date:   2016/08/02 17:06
Subject:Re: Spark GraphFrames

Hi,
Kay wrote a procedure to use GraphFrames with Spark.
https://gist.github.com/kayousterhout/7008a8ebf2babeedc7ce6f8723fd1bf4

Kazuaki Ishizaki

From:Divya Gehlot 
To:"user @spark" 
Date:2016/08/02 14:52
Subject:Spark GraphFrames

Hi,

Has anybody has worked with GraphFrames.
Pls let me know as I need to know the real case scenarios where It can 
used .

Thanks,
Divya

Re: The equivalent for INSTR in Spark FP

No thinking on my part!!!

rs.select(mySubstr($"transactiondescription", lit(1),
instr($"transactiondescription", "CD"))).show(2)
+--+
|UDF(transactiondescription,1,instr(transactiondescription,CD))|
+--+
|   VERSEAS TRANSACTI C|
|   XYZ.COM 80...|
+--+
only showing top 2 rows

Let me test it.

Cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 23:43, Mich Talebzadeh 
wrote:

> Thanks Jacek.
>
> It sounds like the issue the position of the second variable in substring()
>
> This works
>
> scala> val wSpec2 =
> Window.partitionBy(substring($"transactiondescription",1,20))
> wSpec2: org.apache.spark.sql.expressions.WindowSpec =
> org.apache.spark.sql.expressions.WindowSpec@1a4eae2
>
> Using udf as suggested
>
> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
>  |  s.substring(start, end) }
> mySubstr: org.apache.spark.sql.UserDefinedFunction =
> UserDefinedFunction(,StringType,List(StringType, IntegerType,
> IntegerType))
>
>
> This was throwing error:
>
> val wSpec2 = Window.partitionBy(substring("transactiondescription",1,
> indexOf("transactiondescription",'CD')-2))
>
>
> So I tried using udf
>
> scala> val wSpec2 =
> Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
> instr('s, "CD")))
>  | )
> :28: error: value select is not a member of
> org.apache.spark.sql.ColumnName
>  val wSpec2 =
> Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
> instr('s, "CD")))
>
> Obviously I am not doing correctly :(
>
> cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 23:02, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Interesting...
>>
>> I'm temping to think that substring function should accept the columns
>> that hold the numbers for start and end. I'd love hearing people's
>> thought on this.
>>
>> For now, I'd say you need to define udf to do substring as follows:
>>
>> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
>> s.substring(start, end) }
>> mySubstr: org.apache.spark.sql.expressions.UserDefinedFunction =
>> UserDefinedFunction(,StringType,Some(List(StringType,
>> IntegerType, IntegerType)))
>>
>> scala> df.show
>> +---+
>> |  s|
>> +---+
>> |hello world|
>> +---+
>>
>> scala> df.select(mySubstr('s, lit(1), instr('s, "ll"))).show
>> +---+
>> |UDF(s, 1, instr(s, ll))|
>> +---+
>> | el|
>> +---+
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Aug 1, 2016 at 11:18 PM, Mich Talebzadeh
>>  wrote:
>> > Thanks Jacek,
>> >
>> > Do I have any other way of writing this with functional programming?
>> >
>> > select
>> >
>> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>> >
>> >
>> > Cheers,
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly
>> disclaimed. The
>> > author will in no case be liable for any monetary damages arising from
>> such
>> >

Re: What are using Spark for

2016-08-02 Thread Sonal Goyal

Hi Rohit,

You can check the powered by spark page for some real usage of Spark.

https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

On Tuesday, August 2, 2016, Rohit L  wrote:

> Hi Everyone,
>
>   I want to know the real world uses cases for which Spark is used and
> hence can you please share for what purpose you are using Apache Spark in
> your project?
>
> --
> Rohit
>


-- 
Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-02 Thread Sonal Goyal

Hi Jestin,

Which of your actions is the bottleneck? Is it group by, count or the join?
Or all of them? It may help to tune the most time consuming ask first.

On Monday, August 1, 2016, Nikolay Zhebet  wrote:

> Yes, Spark always trying to deliver snippet of code to the data (not vice
> versa). But you should realize, that if you try to run groupBY or Join on
> the large dataset, then you always should migrate temporary localy grouped
> data from one worker node to the another(It is shuffle operation as i
> know). In the end of all batch proceses, you can fetch your grouped
> dataset. But in underhood you can see alot of network connection between
> worker-nodes, because all your 2TB data was splitted on 128MB parts and was
> writed on the different HDFSDataNodes.
>
> As example: You analyze your workflow and realized, that in most cases,
> you  grouped your data by date(-mm-dd). In this case you can save data
> from all day in one Region Server(if you use Spark-on-HBase DataFrame). In
> this case your "group By date" operation can be done on the local
> worker-node and without shuffling your temporary data between other
> workers-nodes. Maybe this article can be usefull:
> http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
>
> 2016-08-01 18:56 GMT+03:00 Jestin Ma  >:
>
>> Hi Nikolay, I'm looking at data locality improvements for Spark, and I
>> have conflicting sources on using YARN for Spark.
>>
>> Reynold said that Spark workers automatically take care of data locality
>> here:
>> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>>
>> However, I've read elsewhere (
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
>> that Spark on YARN increases data locality because YARN tries to place
>> tasks next to HDFS blocks.
>>
>> Can anyone verify/support one side or the other?
>>
>> Thank you,
>> Jestin
>>
>> On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet > > wrote:
>>
>>> Hi.
>>> Maybe you can help "data locality"..
>>> If you use groupBY and joins, than most likely you will see alot of
>>> network operations. This can be werry slow. You can try prepare, transform
>>> your information in that way, what can minimize transporting temporary
>>> information between worker-nodes.
>>>
>>> Try google in this way "Data locality in Hadoop"
>>>
>>>
>>> 2016-08-01 4:41 GMT+03:00 Jestin Ma >> >:
>>>
 It seems that the number of tasks being this large do not matter. Each
 task was set default by the HDFS as 128 MB (block size) which I've heard to
 be ok. I've tried tuning the block (task) size to be larger and smaller to
 no avail.

 I tried coalescing to 50 but that introduced large data skew and slowed
 down my job a lot.

 On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich > wrote:

> 15000 seems like a lot of tasks for that size. Test it out with a
> .coalesce(50) placed right after loading the data. It will probably either
> run faster or crash with out of memory errors.
>
> On Jul 29, 2016, at 9:02 AM, Jestin Ma  > wrote:
>
> I am processing ~2 TB of hdfs data using DataFrames. The size of a
> task is equal to the block size specified by hdfs, which happens to be 128
> MB, leading to about 15000 tasks.
>
> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
> I'm performing groupBy, count, and an outer-join with another
> DataFrame of ~200 MB size (~80 MB cached but I don't need to cache it),
> then saving to disk.
>
> Right now it takes about 55 minutes, and I've been trying to tune it.
>
> I read on the Spark Tuning guide that:
> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>
> This means that I should have about 30-50 tasks instead of 15000, and
> each task would be much bigger in size. Is my understanding correct, and 
> is
> this suggested? I've read from difference sources to decrease or increase
> parallelism, or even keep it default.
>
> Thank you for your help,
> Jestin
>
>
>

>>>
>>
>

-- 
Best Regards,
Sonal
Founder, Nube Technologies 
Reifier at Strata Hadoop World 
Reifier at Spark Summit 2015

Re: The equivalent for INSTR in Spark FP

it should be lit(0) :)

rs.select(mySubstr($"transactiondescription", lit(0),
instr($"transactiondescription", "CD"))).show(1)
+--+
|UDF(transactiondescription,0,instr(transactiondescription,CD))|
+--+
|  OVERSEAS TRANSACTI C|
+--+



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 2 August 2016 at 08:52, Mich Talebzadeh 
wrote:

> No thinking on my part!!!
>
> rs.select(mySubstr($"transactiondescription", lit(1),
> instr($"transactiondescription", "CD"))).show(2)
> +--+
> |UDF(transactiondescription,1,instr(transactiondescription,CD))|
> +--+
> |   VERSEAS TRANSACTI C|
> |   XYZ.COM 80...|
> +--+
> only showing top 2 rows
>
> Let me test it.
>
> Cheers
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 23:43, Mich Talebzadeh 
> wrote:
>
>> Thanks Jacek.
>>
>> It sounds like the issue the position of the second variable in
>> substring()
>>
>> This works
>>
>> scala> val wSpec2 =
>> Window.partitionBy(substring($"transactiondescription",1,20))
>> wSpec2: org.apache.spark.sql.expressions.WindowSpec =
>> org.apache.spark.sql.expressions.WindowSpec@1a4eae2
>>
>> Using udf as suggested
>>
>> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
>>  |  s.substring(start, end) }
>> mySubstr: org.apache.spark.sql.UserDefinedFunction =
>> UserDefinedFunction(,StringType,List(StringType, IntegerType,
>> IntegerType))
>>
>>
>> This was throwing error:
>>
>> val wSpec2 = Window.partitionBy(substring("transactiondescription",1,
>> indexOf("transactiondescription",'CD')-2))
>>
>>
>> So I tried using udf
>>
>> scala> val wSpec2 =
>> Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
>> instr('s, "CD")))
>>  | )
>> :28: error: value select is not a member of
>> org.apache.spark.sql.ColumnName
>>  val wSpec2 =
>> Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
>> instr('s, "CD")))
>>
>> Obviously I am not doing correctly :(
>>
>> cheers
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 August 2016 at 23:02, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> Interesting...
>>>
>>> I'm temping to think that substring function should accept the columns
>>> that hold the numbers for start and end. I'd love hearing people's
>>> thought on this.
>>>
>>> For now, I'd say you need to define udf to do substring as follows:
>>>
>>> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
>>> s.substring(start, end) }
>>> mySubstr: org.apache.spark.sql.expressions.UserDefinedFunction =
>>> UserDefinedFunction(,StringType,Some(List(StringType,
>>> IntegerType, IntegerType)))
>>>
>>> scala> df.show
>>> +---+
>>> |  s|
>>> +---+
>>> |hello world|
>>> +---+
>>>
>>> scala> df.select(mySubstr('s, lit(1), instr('s,

Re: Does it has a way to config limit in query on STS by default?