Re: SparkR : glm model

2016-06-11 Thread Sun Rui
You were looking at some old code.
poisson family is supported in latest master branch.
You can try spark 2.0 preview release from 
http://spark.apache.org/news/spark-2.0.0-preview.html 


> On Jun 10, 2016, at 12:14, april_ZMQ  wrote:
> 
> Hi all,
> 
> I'm a student who are working on a data analysis project with sparkR.
> 
> I found out that GLM (generalized linear model) only supports two types of
> distribution,  "gaussian" and  "binomial". 
> However, our project is requiring the "poisson" distribution. Meanwhile, I
> found out that sparkR was supporting "poisson"before but now this function
> is closed. https://issues.apache.org/jira/browse/SPARK-12566
>   
> 
> Is there any approaches that I can use the previous official package of
> poisson distribution in SparkR instead?
> 
> Thank you very much!
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-glm-model-tp27134.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Slow collecting of large Spark Data Frames into R

2016-06-11 Thread Sun Rui
Hi, Jonathan,

Thanks for reporting. This is a known issue that the community would like to 
address later.

Please refer to https://issues.apache.org/jira/browse/SPARK-14037. It would be 
better that you can profile your use case using the method  discussed in the 
JIRA issue and paste the metrics information into it? This would be helpful for 
addressing the issue.

> On Jun 11, 2016, at 08:31, Jonathan Mortensen  wrote:
> 
> 16BG



Re: Running Spark in Standalone or local modes

2016-06-11 Thread Mich Talebzadeh
Hi Ashok

Your points:

"
I know I can start spark-shell by launching the shell itself

spark-shell

Now I know that in standalone mode I can also connect to master

spark-shell --master spark://:7077

My point is what are the differences between these two start-up modes for
spark-shell? If I start spark-shell and connect to master what performance
gain will I get if any or it does not matter. Is it the same as for
spark-submit"

When you use spark-shell or for that matter spark-sql, *you are staring
spark-submit under the bonnet*. These two shells are created to make life
easier to work on Spark.

However, if you look at what $SPARK_HOME/bin/spark-shell do in the
script, you will notice my point:

"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main
--name "Spark shell" "$@"

So that is basically spark-submit JVM invoked with the name "Spark shell"

Since it is using spark-submit it takes all the parameters related to
spark-submit as described in here


For example the default Web GUI for Spark is 4040. However, I start it with
5 and modified it to call it a different name

"${SPARK_HOME}"/bin/spark-submit *--conf "spark.ui.port=5"* --class
org.apache.spark.repl.Main --name *"my own Spark shell"* "$@"

On local mode (where you are not starting master and slaves/workers) the
application will try to grap all the available CPUs (in theory) unless you
restrict it with master local[n], you can see that in the GUI web page in
Environment tab as spark master local[n]. In this mode it is pretty simple
and you can run as many JVMs (spark-submit) as your resources allow. The
GUI starts by 4040, next one 4041 and so forth.

The crucial point is that by default Spark will deploy --master local mode.
 You can look at the resource usage through the GUI and also using OS tool
say "free" or something similar

In Standalone cluster mode *where Spark deploys its own scheduling, *you
start start-master and start-slaves (starts workers) and you end up with a
more distributed system with a number of worker processes on different
nodes using parallelism to speed up the process. This is in contrast to
"local" mode where it is all happening on the same physical host and your
best hope is using all the available cores.
Hence in summary by using Spark in standalone mode (actually this
terminology is a bit misleading, it would be better if they called it Spark
Own Scheduler Mode (OSM)),  you will have better performance due to
clustering nature of Spark.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 June 2016 at 22:38, Gavin Yue  wrote:

> Sorry I have a typo.
>
> Which means spark does not use yarn or mesos in standalone mode...
>
>
>
> On Jun 11, 2016, at 14:35, Mich Talebzadeh 
> wrote:
>
> Hi Gavin,
>
> I believe in standalone mode a simple cluster manager is included with
> Spark that makes it easy to set up a cluster. It does not rely on YARN or
> Mesos.
>
> In summary this is from my notes:
>
>
>-
>
>Spark Local - Spark runs on the local host. This is the simplest set
>up and best suited for learners who want to understand different concepts
>of Spark and those performing unit testing.
>-
>
>Spark Standalone – a simple cluster manager included with Spark that
>makes it easy to set up a cluster.
>-
>
>YARN Cluster Mode, the Spark driver runs inside an application master
>process which is managed by YARN on the cluster, and the client can go away
>after initiating the application.
>-
>
>Mesos. I have not used it so cannot comment
>
> YARN Client Mode, the driver runs in the client process, and the
> application master is only used for requesting resources from YARN. Unlike
> Local or Spark standalone modes, in which the master’s address is specified
> in the --master parameter, in YARN mode the ResourceManager’s address is
> picked up from the Hadoop configuration. Thus, the --master parameter is
> yarn
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 June 2016 at 22:26, Gavin Yue  wrote:
>
>> The standalone mode is against Yarn mode or Mesos mode, which means spark
>> uses Yarn or Mesos as cluster managements.
>>
>> Local mode is actually a standalone mode which everything runs on the
>> single local machine instead of remote clusters.
>>
>> That is my understanding.
>>
>>
>> On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar <
>> ashok34...@yahoo.com.invalid> wrote:
>>
>>> Thank 

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Gavin Yue
Sorry I have a typo. 

Which means spark does not use yarn or mesos in standalone mode...



> On Jun 11, 2016, at 14:35, Mich Talebzadeh  wrote:
> 
> Hi Gavin,
> 
> I believe in standalone mode a simple cluster manager is included with Spark 
> that makes it easy to set up a cluster. It does not rely on YARN or Mesos.
> 
> In summary this is from my notes:
> 
> Spark Local - Spark runs on the local host. This is the simplest set up and 
> best suited for learners who want to understand different concepts of Spark 
> and those performing unit testing.
> Spark Standalone – a simple cluster manager included with Spark that makes it 
> easy to set up a cluster.
> YARN Cluster Mode, the Spark driver runs inside an application master process 
> which is managed by YARN on the cluster, and the client can go away after 
> initiating the application.
> Mesos. I have not used it so cannot comment
> YARN Client Mode, the driver runs in the client process, and the application 
> master is only used for requesting resources from YARN. Unlike Local or Spark 
> standalone modes, in which the master’s address is specified in the --master 
> parameter, in YARN mode the ResourceManager’s address is picked up from the 
> Hadoop configuration. Thus, the --master parameter is yarn
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 11 June 2016 at 22:26, Gavin Yue  wrote:
>> The standalone mode is against Yarn mode or Mesos mode, which means spark 
>> uses Yarn or Mesos as cluster managements. 
>> 
>> Local mode is actually a standalone mode which everything runs on the single 
>> local machine instead of remote clusters.
>> 
>> That is my understanding. 
>> 
>> 
>>> On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar 
>>>  wrote:
>>> Thank you for grateful
>>> 
>>> I know I can start spark-shell by launching the shell itself
>>> 
>>> spark-shell 
>>> 
>>> Now I know that in standalone mode I can also connect to master
>>> 
>>> spark-shell --master spark://:7077
>>> 
>>> My point is what are the differences between these two start-up modes for 
>>> spark-shell? If I start spark-shell and connect to master what performance 
>>> gain will I get if any or it does not matter. Is it the same as for 
>>> spark-submit 
>>> 
>>> regards
>>> 
>>> 
>>> On Saturday, 11 June 2016, 19:39, Mohammad Tariq  wrote:
>>> 
>>> 
>>> Hi Ashok,
>>> 
>>> In local mode all the processes run inside a single jvm, whereas in 
>>> standalone mode we have separate master and worker processes running in 
>>> their own jvms.
>>> 
>>> To quickly test your code from within your IDE you could probable use the 
>>> local mode. However, to get a real feel of how Spark operates I would 
>>> suggest you to have a standalone setup as well. It's just the matter of 
>>> launching a standalone cluster either manually(by starting a master and 
>>> workers by hand), or by using the launch scripts provided with Spark 
>>> package. 
>>> 
>>> You can find more on this here.
>>> 
>>> HTH
>>> 
>>>  
>>> 
>>> Tariq, Mohammad
>>> about.me/mti
>>> 
>>> 
>>> 
>>>  
>>> 
>>> On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar 
>>>  wrote:
>>> Hi,
>>> 
>>> What is the difference between running Spark in Local mode or standalone 
>>> mode?
>>> 
>>> Are they the same. If they are not which is best suited for non prod work.
>>> 
>>> I am also aware that one can run Spark in Yarn mode as well.
>>> 
>>> Thanks
> 


Re: Running Spark in Standalone or local modes

2016-06-11 Thread Mich Talebzadeh
Hi Gavin,

I believe in standalone mode a simple cluster manager is included with
Spark that makes it easy to set up a cluster. It does not rely on YARN or
Mesos.

In summary this is from my notes:


   -

   Spark Local - Spark runs on the local host. This is the simplest set up
   and best suited for learners who want to understand different concepts of
   Spark and those performing unit testing.
   -

   Spark Standalone – a simple cluster manager included with Spark that
   makes it easy to set up a cluster.
   -

   YARN Cluster Mode, the Spark driver runs inside an application master
   process which is managed by YARN on the cluster, and the client can go away
   after initiating the application.
   -

   Mesos. I have not used it so cannot comment

YARN Client Mode, the driver runs in the client process, and the
application master is only used for requesting resources from YARN. Unlike
Local or Spark standalone modes, in which the master’s address is specified
in the --master parameter, in YARN mode the ResourceManager’s address is
picked up from the Hadoop configuration. Thus, the --master parameter is
yarn

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 June 2016 at 22:26, Gavin Yue  wrote:

> The standalone mode is against Yarn mode or Mesos mode, which means spark
> uses Yarn or Mesos as cluster managements.
>
> Local mode is actually a standalone mode which everything runs on the
> single local machine instead of remote clusters.
>
> That is my understanding.
>
>
> On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar <
> ashok34...@yahoo.com.invalid> wrote:
>
>> Thank you for grateful
>>
>> I know I can start spark-shell by launching the shell itself
>>
>> spark-shell
>>
>> Now I know that in standalone mode I can also connect to master
>>
>> spark-shell --master spark://:7077
>>
>> My point is what are the differences between these two start-up modes for
>> spark-shell? If I start spark-shell and connect to master what performance
>> gain will I get if any or it does not matter. Is it the same as for 
>> spark-submit
>>
>>
>>
>> regards
>>
>>
>> On Saturday, 11 June 2016, 19:39, Mohammad Tariq 
>> wrote:
>>
>>
>> Hi Ashok,
>>
>> In local mode all the processes run inside a single jvm, whereas in
>> standalone mode we have separate master and worker processes running in
>> their own jvms.
>>
>> To quickly test your code from within your IDE you could probable use the
>> local mode. However, to get a real feel of how Spark operates I would
>> suggest you to have a standalone setup as well. It's just the matter
>> of launching a standalone cluster either manually(by starting a master and
>> workers by hand), or by using the launch scripts provided with Spark
>> package.
>>
>> You can find more on this *here*
>> .
>>
>> HTH
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> 
>>
>>
>> On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar <
>> ashok34...@yahoo.com.invalid> wrote:
>>
>> Hi,
>>
>> What is the difference between running Spark in Local mode or standalone
>> mode?
>>
>> Are they the same. If they are not which is best suited for non prod work.
>>
>> I am also aware that one can run Spark in Yarn mode as well.
>>
>> Thanks
>>
>>
>>
>>
>>
>


Re: Running Spark in Standalone or local modes

2016-06-11 Thread Gavin Yue
The standalone mode is against Yarn mode or Mesos mode, which means spark
uses Yarn or Mesos as cluster managements.

Local mode is actually a standalone mode which everything runs on the
single local machine instead of remote clusters.

That is my understanding.


On Sat, Jun 11, 2016 at 12:40 PM, Ashok Kumar 
wrote:

> Thank you for grateful
>
> I know I can start spark-shell by launching the shell itself
>
> spark-shell
>
> Now I know that in standalone mode I can also connect to master
>
> spark-shell --master spark://:7077
>
> My point is what are the differences between these two start-up modes for
> spark-shell? If I start spark-shell and connect to master what performance
> gain will I get if any or it does not matter. Is it the same as for 
> spark-submit
>
>
>
> regards
>
>
> On Saturday, 11 June 2016, 19:39, Mohammad Tariq 
> wrote:
>
>
> Hi Ashok,
>
> In local mode all the processes run inside a single jvm, whereas in
> standalone mode we have separate master and worker processes running in
> their own jvms.
>
> To quickly test your code from within your IDE you could probable use the
> local mode. However, to get a real feel of how Spark operates I would
> suggest you to have a standalone setup as well. It's just the matter
> of launching a standalone cluster either manually(by starting a master and
> workers by hand), or by using the launch scripts provided with Spark
> package.
>
> You can find more on this *here*
> .
>
> HTH
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>
> On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar <
> ashok34...@yahoo.com.invalid> wrote:
>
> Hi,
>
> What is the difference between running Spark in Local mode or standalone
> mode?
>
> Are they the same. If they are not which is best suited for non prod work.
>
> I am also aware that one can run Spark in Yarn mode as well.
>
> Thanks
>
>
>
>
>


Accuracy of BinaryClassificationMetrics

2016-06-11 Thread Marco Mistroni
HI all
 which method shall i use to verify the accuracy of a
BinaryClassificationMetrics ?
the multiClassMetrics has a precision() method but that is missing
on the BinaryClassificationMetrics

thanks
 marco


Re: Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar
Thank you for grateful
I know I can start spark-shell by launching the shell itself
spark-shell 

Now I know that in standalone mode I can also connect to master
spark-shell --master spark://:7077

My point is what are the differences between these two start-up modes for 
spark-shell? If I start spark-shell and connect to master what performance gain 
will I get if any or it does not matter. Is it the same as for spark-submit 
regards 

On Saturday, 11 June 2016, 19:39, Mohammad Tariq  wrote:
 

 Hi Ashok,
In local mode all the processes run inside a single jvm, whereas in standalone 
mode we have separate master and worker processes running in their own jvms.
To quickly test your code from within your IDE you could probable use the local 
mode. However, to get a real feel of how Spark operates I would suggest you to 
have a standalone setup as well. It's just the matter of launching a standalone 
cluster either manually(by starting a master and workers by hand), or by using 
the launch scripts provided with Spark package. 
You can find more on this here.
HTH

|   |
| 
| Tariq, Mohammad
| about.me/mti |

 |
|  |

   |
|   |


On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar  
wrote:

Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks



  

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Mich Talebzadeh
yes absolutely Ted.

Thanks for highlighting it



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 June 2016 at 19:00, Ted Yu  wrote:

> Another source is the presentation on various ocnferences.
> e.g.
>
> http://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-science-and-production
>
> FYI
>
> On Sat, Jun 11, 2016 at 8:47 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Interesting.
>>
>> The pace of development in this field is such that practically every
>> single book in Big Data landscape gets out of data before the ink dries on
>> it  :)
>>
>> I concur that they serve as good reference for starters but in my opinion
>> the best way to learn is to start from on-line docs (and these are pretty
>> respectful when it comes to Spark) and progress from there.
>>
>> If you have a certain problem then put to this group and I am sure
>> someone somewhere in this forum has come across it. Also most of these
>> books' authors actively contribute to this mailing list.
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 11 June 2016 at 16:10, Ted Yu  wrote:
>>
>>>
>>> https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ref=sr_1_1?ie=UTF8=1465657706=8-1=spark+mllib
>>>
>>>
>>> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib
>>>
>>>
>>> https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib
>>>
>>>
>>> On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel  wrote:
>>>

 Hey

 Namaskara~Nalama~Guten Tag~Bonjour

 I am a newbie to Machine Learning (MLIB and other libraries on Spark)

 Which would be the best book to learn up?

 Thanks
 Deepak
--
 Keigu

 Deepak
 73500 12833
 www.simtree.net, dee...@simtree.net
 deic...@gmail.com

 LinkedIn: www.linkedin.com/in/deicool
 Skype: thumsupdeicool
 Google talk: deicool
 Blog: http://loveandfearless.wordpress.com
 Facebook: http://www.facebook.com/deicool

 "Contribute to the world, environment and more :
 http://www.gridrepublic.org
 "

>>>
>>>
>>
>


Re: Running Spark in Standalone or local modes

2016-06-11 Thread Mohammad Tariq
Hi Ashok,

In local mode all the processes run inside a single jvm, whereas in
standalone mode we have separate master and worker processes running in
their own jvms.

To quickly test your code from within your IDE you could probable use the
local mode. However, to get a real feel of how Spark operates I would
suggest you to have a standalone setup as well. It's just the matter
of launching a standalone cluster either manually(by starting a master and
workers by hand), or by using the launch scripts provided with Spark
package.

You can find more on this *here*
.

HTH



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar 
wrote:

> Hi,
>
> What is the difference between running Spark in Local mode or standalone
> mode?
>
> Are they the same. If they are not which is best suited for non prod work.
>
> I am also aware that one can run Spark in Yarn mode as well.
>
> Thanks
>


Running Spark in Standalone or local modes

2016-06-11 Thread Ashok Kumar
Hi,
What is the difference between running Spark in Local mode or standalone mode?
Are they the same. If they are not which is best suited for non prod work.
I am also aware that one can run Spark in Yarn mode as well.
Thanks

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Ted Yu
Another source is the presentation on various ocnferences.
e.g.
http://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-science-and-production

FYI

On Sat, Jun 11, 2016 at 8:47 AM, Mich Talebzadeh 
wrote:

> Interesting.
>
> The pace of development in this field is such that practically every
> single book in Big Data landscape gets out of data before the ink dries on
> it  :)
>
> I concur that they serve as good reference for starters but in my opinion
> the best way to learn is to start from on-line docs (and these are pretty
> respectful when it comes to Spark) and progress from there.
>
> If you have a certain problem then put to this group and I am sure someone
> somewhere in this forum has come across it. Also most of these books'
> authors actively contribute to this mailing list.
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 June 2016 at 16:10, Ted Yu  wrote:
>
>>
>> https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ref=sr_1_1?ie=UTF8=1465657706=8-1=spark+mllib
>>
>>
>> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib
>>
>>
>> https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib
>>
>>
>> On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel  wrote:
>>
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>> I am a newbie to Machine Learning (MLIB and other libraries on Spark)
>>>
>>> Which would be the best book to learn up?
>>>
>>> Thanks
>>> Deepak
>>>--
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, dee...@simtree.net
>>> deic...@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>
>>
>


Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Mich Talebzadeh
Interesting.

The pace of development in this field is such that practically every single
book in Big Data landscape gets out of data before the ink dries on it  :)

I concur that they serve as good reference for starters but in my opinion
the best way to learn is to start from on-line docs (and these are pretty
respectful when it comes to Spark) and progress from there.

If you have a certain problem then put to this group and I am sure someone
somewhere in this forum has come across it. Also most of these books'
authors actively contribute to this mailing list.


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 11 June 2016 at 16:10, Ted Yu  wrote:

>
> https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ref=sr_1_1?ie=UTF8=1465657706=8-1=spark+mllib
>
>
> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib
>
>
> https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib
>
>
> On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel  wrote:
>
>>
>> Hey
>>
>> Namaskara~Nalama~Guten Tag~Bonjour
>>
>> I am a newbie to Machine Learning (MLIB and other libraries on Spark)
>>
>> Which would be the best book to learn up?
>>
>> Thanks
>> Deepak
>>--
>> Keigu
>>
>> Deepak
>> 73500 12833
>> www.simtree.net, dee...@simtree.net
>> deic...@gmail.com
>>
>> LinkedIn: www.linkedin.com/in/deicool
>> Skype: thumsupdeicool
>> Google talk: deicool
>> Blog: http://loveandfearless.wordpress.com
>> Facebook: http://www.facebook.com/deicool
>>
>> "Contribute to the world, environment and more :
>> http://www.gridrepublic.org
>> "
>>
>
>


Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Ted Yu
https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ref=sr_1_1?ie=UTF8=1465657706=8-1=spark+mllib

https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib

https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib


On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel  wrote:

>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
> I am a newbie to Machine Learning (MLIB and other libraries on Spark)
>
> Which would be the best book to learn up?
>
> Thanks
> Deepak
>--
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>


Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Deepak Goel
Hey

Namaskara~Nalama~Guten Tag~Bonjour

I am a newbie to Machine Learning (MLIB and other libraries on Spark)

Which would be the best book to learn up?

Thanks
Deepak
   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"


Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-11 Thread Ajay Chander
I tried implementing the same functionality through Scala as well. But no
luck so far. Just wondering if anyone here tried using Spark SQL to read
SAS dataset? Thank you

Regards,
Ajay

On Friday, June 10, 2016, Ajay Chander  wrote:

> Mich, I completely agree with you. I built another Spark SQL application
> which reads data from MySQL and SQL server and writes the data into
> Hive(parquet+snappy format). I have this problem only when I read directly
> from remote SAS system. The interesting part is I am using same driver to
> read data through pure Java app and spark app. It works fine in Java app,
> so I cannot blame SAS driver here. Trying to understand where the problem
> could be. Thanks for sharing this with me.
>
> On Friday, June 10, 2016, Mich Talebzadeh  > wrote:
>
>> I personally use Scala to do something similar. For example here I
>> extract data from an Oracle table and store in ORC table in Hive. This is
>> compiled via sbt as run with SparkSubmit.
>>
>> It is similar to your code but in Scala. Note that I do not enclose my
>> column names in double quotes.
>>
>> import org.apache.spark.SparkContext
>> import org.apache.spark.SparkConf
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.functions._
>>
>> object ETL_scratchpad_dummy {
>>   def main(args: Array[String]) {
>>   val conf = new SparkConf().
>>setAppName("ETL_scratchpad_dummy").
>>set("spark.driver.allowMultipleContexts", "true")
>>   val sc = new SparkContext(conf)
>>   // Create sqlContext based on HiveContext
>>   val sqlContext = new HiveContext(sc)
>>   import sqlContext.implicits._
>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   println ("\nStarted at"); sqlContext.sql("SELECT
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>> ").collect.foreach(println)
>>   HiveContext.sql("use oraclehadoop")
>>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>>   var _username : String = "scratchpad"
>>   var _password : String = ""
>>
>>   // Get data from Oracle table scratchpad.dummy
>>   val d = HiveContext.load("jdbc",
>>   Map("url" -> _ORACLEserver,
>>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>   "user" -> _username,
>>   "password" -> _password))
>>
>>d.registerTempTable("tmp")
>>   //
>>   // Need to create and populate target ORC table oraclehadoop.dummy
>>   //
>>   HiveContext.sql("use oraclehadoop")
>>   //
>>   // Drop and create table dummy
>>   //
>>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>>   var sqltext : String = ""
>>   sqltext = """
>>   CREATE TABLE oraclehadoop.dummy (
>>  ID INT
>>, CLUSTERED INT
>>, SCATTERED INT
>>, RANDOMISED INT
>>, RANDOM_STRING VARCHAR(50)
>>, SMALL_VC VARCHAR(10)
>>, PADDING  VARCHAR(10)
>>   )
>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>   STORED AS ORC
>>   TBLPROPERTIES (
>>   "orc.create.index"="true",
>>   "orc.bloom.filter.columns"="ID",
>>   "orc.bloom.filter.fpp"="0.05",
>>   "orc.compress"="SNAPPY",
>>   "orc.stripe.size"="16777216",
>>   "orc.row.index.stride"="1" )
>>   """
>>HiveContext.sql(sqltext)
>>   //
>>   // Put data in Hive table. Clean up is already done
>>   //
>>   sqltext = """
>>   INSERT INTO TABLE oraclehadoop.dummy
>>   SELECT
>>   ID
>> , CLUSTERED
>> , SCATTERED
>> , RANDOMISED
>> , RANDOM_STRING
>> , SMALL_VC
>> , PADDING
>>   FROM tmp
>>   """
>>HiveContext.sql(sqltext)
>>   println ("\nFinished at"); sqlContext.sql("SELECT
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>> ").collect.foreach(println)
>>   sys.exit()
>>  }
>> }
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 10 June 2016 at 23:38, Ajay Chander  wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the response. If you look at my programs, I am not writings
>>> my queries to include column names in a pair of "". My driver in spark
>>> program is generating such query with column names in "" which I do not
>>> want. On the other hand, I am using the same driver in my pure Java program
>>> which is attached, in that program the same driver is generating a proper
>>> sql query with out "".
>>>
>>> Pure Java log:
>>>
>>> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>> a.sr_no,a.start_dt,a.end_dt FROM 

Big Data Interview

2016-06-11 Thread Chaturvedi Chola
Good book on interview preparation for big data

https://notionpress.com/read/big-data-interview-faqs