Re: Choice of IDE for Spark

2021-10-01 Thread Holden Karau
Personally I like Jupyter notebooks for my interactive work and then once
I’ve done my exploration I switch back to emacs with either scala-metals or
Python mode.

I think the main takeaway is: do what feels best for you, there is no one
true way to develop in Spark.

On Fri, Oct 1, 2021 at 1:28 AM Mich Talebzadeh 
wrote:

> Thanks guys for your comments.
>
> I agree with you Florian that opening a terminal say in VSC allows you to
> run a shell script (an sh file) to submit your spark code, however, this
> really makes sense if your IDE is running on a Linux host submitting a job
> to a Kubernetes cluster or YARN cluster.
>
> For Python, I will go with PyCharm which is specific to the Python world.
> With Spark, I have used IntelliJ with Spark plug in on MAC for development
> work. Then created a JAR file, gzipped the whole project and scped to an
> IBM sandbox, untarred it and ran it with a pre-prepared shell with
> environment plugin for dev, test, staging etc.
>
> IDE is also useful for looking at csv, tsv type files or creating json
> from one form to another. For json validation,especially if the file is too
> large, you may have restriction loading the file to web json validator
> because of the risk of proprietary data being exposed. There is a tool
> called jq  (a lightweight and flexible
> command-line JSON processor), that comes pretty handy to validate json.
> Download and install it on OS and run it as
>
> zcat .tgz | jq
>
> That will validate the whole tarred and gzipped json file. Otherwise most
> of these IDE tools come with add-on plugins, for various needs. My
> preference would be to use the best available IDE for the job. VSC I would
> consider as a general purpose tool. If all fails, one can always use OS
> stuff like vi, vim, sed, awk etc 樂
>
>
> Cheers
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 1 Oct 2021 at 06:55, Florian CASTELAIN <
> florian.castel...@redlab.io> wrote:
>
>> Hello.
>>
>> Any "evolved" code editor allows you to create tasks (or builds, or
>> whatever they are called in the IDE you chose). If you do not find anything
>> that packages by default all you need, you could just create your own tasks.
>>
>>
>> *For yarn, one needs to open a terminal and submit from there. *
>>
>> You can create task(s) that launch your yarn commands.
>>
>>
>> *With VSC, you get stuff for working with json files but I am not sure
>> with a plugin for Python *
>>
>> In your json task configuration, you can launch whatever you want:
>> python, shell. I bet you could launch your favorite video game (just make a
>> task called "let's have a break" )
>>
>> Just to say, if you want everything exactly the way you want, I do not
>> think you will find an IDE that does it. You will have to customize it.
>> (correct me if wrong, of course).
>>
>> Have a good day.
>>
>> *[image: signature_299490615]* 
>>
>>
>>
>> [image: Banner] 
>>
>>
>>
>> *Florian CASTELAIN *
>> *Ingénieur Logiciel*
>>
>> 72 Rue de la République, 76140 Le Petit-Quevilly
>> 
>> m: +33 616 530 226
>> e: florian.castel...@redlab.io w: www.redlab.io
>>
>> --
>> *De :* Jeff Zhang 
>> *Envoyé :* jeudi 30 septembre 2021 13:57
>> *À :* Mich Talebzadeh 
>> *Cc :* user @spark 
>> *Objet :* Re: Choice of IDE for Spark
>>
>> IIRC, you want an IDE for pyspark on yarn ?
>>
>> Mich Talebzadeh  于2021年9月30日周四 下午7:00写道:
>>
>> Hi,
>>
>> This may look like a redundant question but it comes about because of the
>> advent of Cloud workstation usage like Amazon workspaces and others.
>>
>> With IntelliJ you are OK with Spark & Scala. With PyCharm you are fine
>> with PySpark and the virtual environment. Mind you as far as I know PyCharm
>> only executes spark-submit in local mode. For yarn, one needs to open a
>> terminal and submit from there.
>>
>> However, in Amazon workstation, you get Visual Studio Code
>>  (VSC, an MS product) and openoffice
>> installed. With VSC, you get stuff for working with json files but I am not
>> sure with a plugin for Python etc, will it be as good as PyCharm? Has
>> anyone used VSC in anger for Spark and if so what is the experience?
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or 

Re: Choice of IDE for Spark

2021-10-01 Thread Nicolas Paris
> With IntelliJ you are OK with Spark & Scala.

also intelliJ as a nice python plugin that turns it into pycharm.


On Thu Sep 30, 2021 at 1:57 PM CEST, Jeff Zhang wrote:
> IIRC, you want an IDE for pyspark on yarn ?
>
> Mich Talebzadeh  于2021年9月30日周四
> 下午7:00写道:
>
> > Hi,
> >
> > This may look like a redundant question but it comes about because of the
> > advent of Cloud workstation usage like Amazon workspaces and others.
> >
> > With IntelliJ you are OK with Spark & Scala. With PyCharm you are fine
> > with PySpark and the virtual environment. Mind you as far as I know PyCharm
> > only executes spark-submit in local mode. For yarn, one needs to open a
> > terminal and submit from there.
> >
> > However, in Amazon workstation, you get Visual Studio Code
> >  (VSC, an MS product) and openoffice
> > installed. With VSC, you get stuff for working with json files but I am not
> > sure with a plugin for Python etc, will it be as good as PyCharm? Has
> > anyone used VSC in anger for Spark and if so what is the experience?
> >
> > Thanks
> >
> >
> >
> >view my Linkedin profile
> > 
> >
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising from
> > such loss, damage or destruction.
> >
> >
> >
>
>
> --
> Best Regards
>
> Jeff Zhang


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Trying to hash cross features with mllib

2021-10-01 Thread Sean Owen
Are you looking for
https://spark.apache.org/docs/latest/ml-features.html#interaction ? That's
the closest built in thing I can think of.  Otherwise you can make custom
transformations.

On Fri, Oct 1, 2021, 8:44 AM David Diebold  wrote:

> Hello everyone,
>
> In MLLib, I’m trying to rely essentially on pipelines to create features
> out of the Titanic dataset, and show-case the power of feature hashing. I
> want to:
>
> -  Apply bucketization on some columns (QuantileDiscretizer is
> fine)
>
> -  Then I want to cross all my columns with each other to have
> cross features.
>
> -  Then I would like to hash all of these cross features into a
> vector.
>
> -  Then give it to a logistic regression.
>
> Looking at the documentation, it looks like the only way to hash features
> is the *FeatureHasher* transformation. It takes multiple columns as
> input, type can be numeric, bool, string (but no vector/array).
>
> But now I’m left wondering how I can create my cross-feature columns. I’m
> looking at a transformation that could take two columns as input, and
> return a numeric, bool, or string. I didn't manage to find anything that
> does the job. There are multiple transformations such as VectorAssembler,
> that operate on vector, but this is not a typeaccepted by the FeatureHasher.
>
> Of course, I could try to combine columns directly in my dataframe (before
> the pipeline kicks-in), but then I would not be able to benefit any more
> from QuantileDiscretizer and other cool functions.
>
>
> Am I missing something in the transformation api ? Or is my approach to
> hashing wrong ? Or should we consider to extend the api somehow ?
>
>
>
> Thank you, kind regards,
>
> David
>


Trying to hash cross features with mllib

2021-10-01 Thread David Diebold
Hello everyone,

In MLLib, I’m trying to rely essentially on pipelines to create features
out of the Titanic dataset, and show-case the power of feature hashing. I
want to:

-  Apply bucketization on some columns (QuantileDiscretizer is fine)

-  Then I want to cross all my columns with each other to have
cross features.

-  Then I would like to hash all of these cross features into a
vector.

-  Then give it to a logistic regression.

Looking at the documentation, it looks like the only way to hash features
is the *FeatureHasher* transformation. It takes multiple columns as input,
type can be numeric, bool, string (but no vector/array).

But now I’m left wondering how I can create my cross-feature columns. I’m
looking at a transformation that could take two columns as input, and
return a numeric, bool, or string. I didn't manage to find anything that
does the job. There are multiple transformations such as VectorAssembler,
that operate on vector, but this is not a typeaccepted by the FeatureHasher.

Of course, I could try to combine columns directly in my dataframe (before
the pipeline kicks-in), but then I would not be able to benefit any more
from QuantileDiscretizer and other cool functions.


Am I missing something in the transformation api ? Or is my approach to
hashing wrong ? Or should we consider to extend the api somehow ?



Thank you, kind regards,

David


Re: Choice of IDE for Spark

2021-10-01 Thread Mich Talebzadeh
Thanks guys for your comments.

I agree with you Florian that opening a terminal say in VSC allows you to
run a shell script (an sh file) to submit your spark code, however, this
really makes sense if your IDE is running on a Linux host submitting a job
to a Kubernetes cluster or YARN cluster.

For Python, I will go with PyCharm which is specific to the Python world.
With Spark, I have used IntelliJ with Spark plug in on MAC for development
work. Then created a JAR file, gzipped the whole project and scped to an
IBM sandbox, untarred it and ran it with a pre-prepared shell with
environment plugin for dev, test, staging etc.

IDE is also useful for looking at csv, tsv type files or creating json from
one form to another. For json validation,especially if the file is too
large, you may have restriction loading the file to web json validator
because of the risk of proprietary data being exposed. There is a tool
called jq  (a lightweight and flexible
command-line JSON processor), that comes pretty handy to validate json.
Download and install it on OS and run it as

zcat .tgz | jq

That will validate the whole tarred and gzipped json file. Otherwise most
of these IDE tools come with add-on plugins, for various needs. My
preference would be to use the best available IDE for the job. VSC I would
consider as a general purpose tool. If all fails, one can always use OS
stuff like vi, vim, sed, awk etc 樂


Cheers


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 1 Oct 2021 at 06:55, Florian CASTELAIN 
wrote:

> Hello.
>
> Any "evolved" code editor allows you to create tasks (or builds, or
> whatever they are called in the IDE you chose). If you do not find anything
> that packages by default all you need, you could just create your own tasks.
>
>
> *For yarn, one needs to open a terminal and submit from there. *
>
> You can create task(s) that launch your yarn commands.
>
>
> *With VSC, you get stuff for working with json files but I am not sure
> with a plugin for Python *
>
> In your json task configuration, you can launch whatever you want: python,
> shell. I bet you could launch your favorite video game (just make a task
> called "let's have a break" )
>
> Just to say, if you want everything exactly the way you want, I do not
> think you will find an IDE that does it. You will have to customize it.
> (correct me if wrong, of course).
>
> Have a good day.
>
> *[image: signature_299490615]* 
>
>
>
> [image: Banner] 
>
>
>
> *Florian CASTELAIN *
> *Ingénieur Logiciel*
>
> 72 Rue de la République, 76140 Le Petit-Quevilly
> m: +33 616 530 226
> e: florian.castel...@redlab.io w: www.redlab.io
>
> --
> *De :* Jeff Zhang 
> *Envoyé :* jeudi 30 septembre 2021 13:57
> *À :* Mich Talebzadeh 
> *Cc :* user @spark 
> *Objet :* Re: Choice of IDE for Spark
>
> IIRC, you want an IDE for pyspark on yarn ?
>
> Mich Talebzadeh  于2021年9月30日周四 下午7:00写道:
>
> Hi,
>
> This may look like a redundant question but it comes about because of the
> advent of Cloud workstation usage like Amazon workspaces and others.
>
> With IntelliJ you are OK with Spark & Scala. With PyCharm you are fine
> with PySpark and the virtual environment. Mind you as far as I know PyCharm
> only executes spark-submit in local mode. For yarn, one needs to open a
> terminal and submit from there.
>
> However, in Amazon workstation, you get Visual Studio Code
>  (VSC, an MS product) and openoffice
> installed. With VSC, you get stuff for working with json files but I am not
> sure with a plugin for Python etc, will it be as good as PyCharm? Has
> anyone used VSC in anger for Spark and if so what is the experience?
>
> Thanks
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>