Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
Welcome Ram! :-)

I would recommend you to check
https://issues.apache.org/jira/browse/SPARK-37935 out as a starter task.

Refer to https://github.com/apache/spark/pull/41504,
https://github.com/apache/spark/pull/41455 as an example PR.

Or you can also add a new sub-task if you find any error messages that need
improvement.

Thanks!

On Mon, Sep 18, 2023 at 9:33 AM Denny Lee  wrote:

> Hi Ram,
>
> We have some good guidance at
> https://spark.apache.org/contributing.html
>
> HTH!
> Denny
>
>
> On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:
>
>>
>>
>>
>> Hello All,
>> Recently, joined this community and would like to contribute. Is there a
>> guideline or recommendation on tasks that can be picked up by a first timer
>> or a started task?.
>>
>> Tried looking at stack overflow tag: apache-spark
>> , couldn't find
>> any information for first time contributors.
>>
>> Looking forward to learning and contributing.
>>
>> Thanks
>> Ram
>>
>


Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram,

We have some good guidance at
https://spark.apache.org/contributing.html

HTH!
Denny


On Sun, Sep 17, 2023 at 17:18 ram manickam  wrote:

>
>
>
> Hello All,
> Recently, joined this community and would like to contribute. Is there a
> guideline or recommendation on tasks that can be picked up by a first timer
> or a started task?.
>
> Tried looking at stack overflow tag: apache-spark
> , couldn't find
> any information for first time contributors.
>
> Looking forward to learning and contributing.
>
> Thanks
> Ram
>


Re: About contribution

2022-01-06 Thread Dennis Jung
Oh, yes. Also check on that.
I just want to know if there's a bit more detail about contribution,
because not just for contribution, but also want to know more deeply of
spark project.

- To review the base code, what is a good point to start?
- Or recommending a blog post or document will be great.
- Roadmap of project.

Thanks.



2022년 1월 6일 (목) 오전 12:24, Sean Owen 님이 작성:

> (There is no project chat)
> See https://spark.apache.org/contributing.html
>
> On Tue, Jan 4, 2022 at 11:42 PM Dennis Jung  wrote:
>
>> Hello, I hope this is not a silly question.
>> (I couldn't find any chat room on spark project, so asking on mail)
>>
>> It has been about a year since using spark in work, and try to make a
>> contribution to this project.
>>
>> I'm currently looking at documents in more detail, and checking the issue
>> in JIRA now. Is there some suggestion of reviewing the code?
>>
>> - Which code part will be good to start?
>> - What will be more helpful for the project?
>>
>> Thanks.
>>
>


Re: About contribution

2022-01-05 Thread Sean Owen
(There is no project chat)
See https://spark.apache.org/contributing.html

On Tue, Jan 4, 2022 at 11:42 PM Dennis Jung  wrote:

> Hello, I hope this is not a silly question.
> (I couldn't find any chat room on spark project, so asking on mail)
>
> It has been about a year since using spark in work, and try to make a
> contribution to this project.
>
> I'm currently looking at documents in more detail, and checking the issue
> in JIRA now. Is there some suggestion of reviewing the code?
>
> - Which code part will be good to start?
> - What will be more helpful for the project?
>
> Thanks.
>


About contribution

2022-01-04 Thread Dennis Jung
Hello, I hope this is not a silly question.
(I couldn't find any chat room on spark project, so asking on mail)

It has been about a year since using spark in work, and try to make a
contribution to this project.

I'm currently looking at documents in more detail, and checking the issue
in JIRA now. Is there some suggestion of reviewing the code?

- Which code part will be good to start?
- What will be more helpful for the project?

Thanks.


Re: Apache Training contribution for Spark - Feedback welcome

2019-07-30 Thread Lars Francke
On Mon, Jul 29, 2019 at 2:46 PM Sean Owen  wrote:

> TL;DR is: take the below as feedback to consider, and proceed as you
> see fit. Nobody's suggesting you can't do this.
>
> On Mon, Jul 29, 2019 at 2:58 AM Lars Francke 
> wrote:
> > The way I read your point is that anyone can publish material (which
> includes source code) under the ALv2 outside of the ASF so why should they
> donate anything to the ASF?
> > If that's what you meant why have Apache Spark or any other Apache
> project for that matter.
> >> I think your premise is that people will _collaborate_ on training
> >> materials if there's an ASF project around it. Maybe so but see below.
> > That's our hope, yes. Should we not do this because it _could_ fail?
>
> Yep this is the answer to your question. The ASF exists to facilitate
> collaboration, not just host. I think the dynamics around
> collaboration on open standard software vs training materials are
> materially different.
>

I don't see a big difference between the two things.
Content is already being collaborated on today (see documentation, websites
and the few instances of training that exist or Wikipedia for that matter).
I'm afraid we'll need to agree to disagree on this one.


> > We - as a company - have created material and sold it for years but
> every time I give a training I see something that I should have updated and
> it's become impossible to keep up. I see the same outdated material from
> other organizations, we've talked to half a dozen or so training companies
> and they all have the same problem. To create quality training material you
> really need someone with deep insider knowledge, and those people are hard
> to come by.
> > So we're trying to shift and collaborate on the material and then
> differentiate ourselves by the trainer itself.
>
> I think this hand-waves past a lot of the concern raised here, but OK
> it's an experiment.
> I don't think it's 'wrong' to try to get people to collaborate on
> slides, sure. It may work well. If it doesn't for reasons raised here,
> well, worse things have happened.
> Consider how you might mitigate possible problems:
> a) what happens when another company wants to donate its Spark content?
>

This has been decided at the ASF level already (allow competing projects,
e.g. Flink & Spark). At the Apache Training level we briefly talked about
that as well. I don't want to go into details of the process but the short
version is: We'd accept anything and would then try to incorporate it into
existing stuff.

b) can you enshrine some best practices like making sure the content
> disclaims official association with the ASF? e.g. a trainer delivering
> it has to note the source but make clear it's not Apache training,
>

Yes.


> etc.
>


Re: Apache Training contribution for Spark - Feedback welcome

2019-07-29 Thread Sean Owen
TL;DR is: take the below as feedback to consider, and proceed as you
see fit. Nobody's suggesting you can't do this.

On Mon, Jul 29, 2019 at 2:58 AM Lars Francke  wrote:
> The way I read your point is that anyone can publish material (which includes 
> source code) under the ALv2 outside of the ASF so why should they donate 
> anything to the ASF?
> If that's what you meant why have Apache Spark or any other Apache project 
> for that matter.
>> I think your premise is that people will _collaborate_ on training
>> materials if there's an ASF project around it. Maybe so but see below.
> That's our hope, yes. Should we not do this because it _could_ fail?

Yep this is the answer to your question. The ASF exists to facilitate
collaboration, not just host. I think the dynamics around
collaboration on open standard software vs training materials are
materially different.

> We - as a company - have created material and sold it for years but every 
> time I give a training I see something that I should have updated and it's 
> become impossible to keep up. I see the same outdated material from other 
> organizations, we've talked to half a dozen or so training companies and they 
> all have the same problem. To create quality training material you really 
> need someone with deep insider knowledge, and those people are hard to come 
> by.
> So we're trying to shift and collaborate on the material and then 
> differentiate ourselves by the trainer itself.

I think this hand-waves past a lot of the concern raised here, but OK
it's an experiment.
I don't think it's 'wrong' to try to get people to collaborate on
slides, sure. It may work well. If it doesn't for reasons raised here,
well, worse things have happened.
Consider how you might mitigate possible problems:
a) what happens when another company wants to donate its Spark content?
b) can you enshrine some best practices like making sure the content
disclaims official association with the ASF? e.g. a trainer delivering
it has to note the source but make clear it's not Apache training,
etc.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Training contribution for Spark - Feedback welcome

2019-07-29 Thread Lars Francke
Happy to discuss this here but you're also invited to bring those points up
at dev@training as other projects might have similar concerns.

The request for assistance still stands. If anyone here is interested in
helping out reviewing and improving the material please reach out.


On Sat, Jul 27, 2019 at 12:01 AM Sean Owen  wrote:

> On Fri, Jul 26, 2019 at 4:01 PM Lars Francke 
> wrote:
> > I understand why it might be seen that way and we need to make sure to
> point out that we have no intention of becoming "The official Apache Spark
> training" because that's not our intention at all.
>
> Of course that's the intention; the problem is perception, and I think
> that's a real problem no matter the intention.
>

Agreed. But that won't stop us from accepting or publishing content. If
that were a dealbreaker then we could move the Training project to the
Attic now.
Along with Livy, Toree, Phoenix, Hivemall and probably dozens of other ASF
projects which provide things on top of other ASF projects.
Neither of those are endorsed as "The official X for Y".


> > In this case, however, a company decided to donate their internal
> material - they didn't create this from scratch for the Apache Training
> project.
> > We want to encourage contributions and just because someone else has
> already created material shouldn't stop us from accepting this.
>
> This much doesn't seem like a compelling motive. Anyone can already
> donate their materials to the public domain or publish under the ALv2.
> The existence of an Apache project around it doesn't do anything...
> except your point below maybe:
>
>
> > Every company creates its own material as an asset to sell. There's very
> little quality open-source material out there.
>
> (Except the example I already gave, among many others! There's a lot
> of free content)
>

The way I read your point is that anyone can publish material (which
includes source code) under the ALv2 outside of the ASF so why should they
donate anything to the ASF?
If that's what you meant why have Apache Spark or any other Apache project
for that matter.

But I don't think that's what you're trying to say.
Hence I believe I must misunderstand and would ask you to
rephrase/reiterate the point your point, please.


> > We did some research around training and especially open-source training
> before we started the initiative and there are some projects out there that
> do this but all we found were silos with a relatively narrow focus and no
> greater community.
>
> I think your premise is that people will _collaborate_ on training
> materials if there's an ASF project around it. Maybe so but see below.
>

That's our hope, yes. Should we not do this because it _could_ fail?


> > Regarding your "outlines" comment: No, this is the "final" material
> (pending review of course). With "Training" we mean training in the sense
> that Cloudera, Databricks et. al. sell as well where an instructor-led
> course is being given using slides. These slides can, but don't have to
> speak for themselves. We're fine with the requirement that an experienced
> instructor needs to give this training. But this is just this content.
> We're also happy to accept other forms of content that are meant for a
> different way of consumption (self-serve). We don't intend to write
> exhaustive or authoritative documentation for projects.
>
> Are we talking about the content attached at TRAINING-17? It doesn't
> look nearly complete or comprehensive enough to endorse as Spark
> training material, IMHO. Again compare to even Jacek's site and
> content for an example of what I think that would look like. It's
> orders of magnitude more complete. I speak for myself, but I would not
> want to endorse that as Spark training with my Apache hat.
>
> I know the premise is, I think, these are _slides_ that trainers can
> deliver, but by themselves there is not enough content for trainers to
> know what to train.
>

No one wants to endorse anything as "official" anything.
And yes: This material is not perfect but that's how open-source works,
doesn't it?
This is an initial patch which can be used to collaborate and improve upon.
This is how Spark also works otherwise it'd have been perfect from version
0.1.

Again: I agree Jacek's material is more complete and we could reach out to
him (assuming he reads this anyway) but the fact is that this company did
so first and I want to encourage contributions.

All we're asking for here is help from the Spark community in making our
content better hoping that someone is interested. If not we'll do the best
we can ourselves. But this is where the experts are.


> What is the need the solves -- is there really demand for 'open
> source' training materials? my experience is that training is by
> definition professional services, and has to be delivered by people as
> a for-pay business, and they need to differentiate on the quality they
> provide. It's just materially different from having open 

Re: Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Sean Owen
On Fri, Jul 26, 2019 at 4:01 PM Lars Francke  wrote:
> I understand why it might be seen that way and we need to make sure to point 
> out that we have no intention of becoming "The official Apache Spark 
> training" because that's not our intention at all.

Of course that's the intention; the problem is perception, and I think
that's a real problem no matter the intention.


> In this case, however, a company decided to donate their internal material - 
> they didn't create this from scratch for the Apache Training project.
> We want to encourage contributions and just because someone else has already 
> created material shouldn't stop us from accepting this.

This much doesn't seem like a compelling motive. Anyone can already
donate their materials to the public domain or publish under the ALv2.
The existence of an Apache project around it doesn't do anything...
except your point below maybe:


> Every company creates its own material as an asset to sell. There's very 
> little quality open-source material out there.

(Except the example I already gave, among many others! There's a lot
of free content)


> We did some research around training and especially open-source training 
> before we started the initiative and there are some projects out there that 
> do this but all we found were silos with a relatively narrow focus and no 
> greater community.

I think your premise is that people will _collaborate_ on training
materials if there's an ASF project around it. Maybe so but see below.


> Regarding your "outlines" comment: No, this is the "final" material (pending 
> review of course). With "Training" we mean training in the sense that 
> Cloudera, Databricks et. al. sell as well where an instructor-led course is 
> being given using slides. These slides can, but don't have to speak for 
> themselves. We're fine with the requirement that an experienced instructor 
> needs to give this training. But this is just this content. We're also happy 
> to accept other forms of content that are meant for a different way of 
> consumption (self-serve). We don't intend to write exhaustive or 
> authoritative documentation for projects.

Are we talking about the content attached at TRAINING-17? It doesn't
look nearly complete or comprehensive enough to endorse as Spark
training material, IMHO. Again compare to even Jacek's site and
content for an example of what I think that would look like. It's
orders of magnitude more complete. I speak for myself, but I would not
want to endorse that as Spark training with my Apache hat.

I know the premise is, I think, these are _slides_ that trainers can
deliver, but by themselves there is not enough content for trainers to
know what to train.

What is the need the solves -- is there really demand for 'open
source' training materials? my experience is that training is by
definition professional services, and has to be delivered by people as
a for-pay business, and they need to differentiate on the quality they
provide. It's just materially different from having open standard
software.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Lars Francke
Sean,

thanks for taking the time to comment.

We've discussed those issues during the proposal stage for the Incubator as
others brought them up as well. I can't remember all the details but let me
go through your points inline.

My reservation here is that as an Apache project, it might appear to
> 'bless' one set of materials as authoritative over all the others out
> there.


I understand why it might be seen that way and we need to make sure to
point out that we have no intention of becoming "The official Apache Spark
training" because that's not our intention at all.


> And there are already lots of good ones. For example, Jacek has
> long maintained a very comprehensive set of free Spark training
> materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> In comparison the slides I see proposed so far only seem like
> outlines?
>

Jacek is indeed doing a fantastic job (and I'm sure others as well).

In this case, however, a company decided to donate their internal material
- they didn't create this from scratch for the Apache Training project.
We want to encourage contributions and just because someone else has
already created material shouldn't stop us from accepting this.

The opposite in fact: There's very little collaboration - in general -
around training material.
Every company creates its own material as an asset to sell. There's very
little quality open-source material out there.
I'm not sure how many companies have created Spark training courses. I
wouldn't be surprised if it goes into the hundreds. And everyone draws the
same or very similar slides (what's an RDD, what's a DataFrame etc.)
We hope to change that and this contribution can be a first start.

We did some research around training and especially open-source training
before we started the initiative and there are some projects out there that
do this but all we found were silos with a relatively narrow focus and no
greater community.

Regarding your "outlines" comment: No, this is the "final" material
(pending review of course). With "Training" we mean training in the sense
that Cloudera, Databricks et. al. sell as well where an instructor-led
course is being given using slides. These slides can, but don't have to
speak for themselves. We're fine with the requirement that an experienced
instructor needs to give this training. But this is just this content.
We're also happy to accept other forms of content that are meant for a
different way of consumption (self-serve). We don't intend to write
exhaustive or authoritative documentation for projects.

It just frees people from having to do the tedious work of creating (and
updating) hundreds of slides.

It's also a separate project from Spark. We might have trouble
> ensuring the info is maintained and up to date, and sometimes outdated
> or incorrect info is worse than none - especially if it appears quasi
> official. The Spark project already maintains and updates its docs
> (which can always be better), so already has its hands full there.
>

Definitely. Outdated information is always a danger and I have no guarantee
that this isn't going to happen here.
The fact that this is hosted and governed by the ASF makes it less likely
to be completely abandoned though as there are clear processes in place for
collaboration that don't depend on a single person (which might be the case
with some of the other things that already exist).
We also hope that communities - like Spark - are also interested in
collaborating and while patches are always welcome so is creating a Jira to
point out outdated information.


> Personally, no strong objection here, but, what's the upside to
> running this as an ASF project vs just letting people continue to
> publish quality tutorials online?
>

Some points come to mind, this list is neither exhaustive nor do all points
apply equally to all the material that others have published:

- Clear and easy guidelines for collaboration
- Not a "bus factor" of one
- Everything is open-source with a friendly license and customizable
- We're still just getting started but because we already have four or five
different contributions we can share one technology stack between all of
them making it easier to collaborate ("everything looks familiar") and
every piece of content benefits from improvements in the technical stack
- We hope to have non-tool focused sessions later as well (e.g. Ingesting
data from Kafka into Elasticsearch using Spark [okay, this would maybe be a
bit too specific for now but something along the lines of a "Data
Ingestion" training]) where we can mix and match from the content we have

I'd have to dig into the original discuss threads in the incubator to find
more but I hope this helps a bit?

Cheers,
Lars


>
>
> On Fri, Jul 26, 2019 at 9:00 AM Lars Francke 
> wrote:
> >
> > Hi Spark c

Re: Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Sean Owen
Generally speaking, I think we want to encourage more training and
tutorial content out there, for sure, so, the more the merrier.

My reservation here is that as an Apache project, it might appear to
'bless' one set of materials as authoritative over all the others out
there. And there are already lots of good ones. For example, Jacek has
long maintained a very comprehensive set of free Spark training
materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
In comparison the slides I see proposed so far only seem like
outlines?

It's also a separate project from Spark. We might have trouble
ensuring the info is maintained and up to date, and sometimes outdated
or incorrect info is worse than none - especially if it appears quasi
official. The Spark project already maintains and updates its docs
(which can always be better), so already has its hands full there.

Personally, no strong objection here, but, what's the upside to
running this as an ASF project vs just letting people continue to
publish quality tutorials online?



On Fri, Jul 26, 2019 at 9:00 AM Lars Francke  wrote:
>
> Hi Spark community,
>
> you may or may not have heard of a new-ish (February 2019) project at Apache: 
> Apache Training (incubating). We aim to develop training material about 
> various projects inside and outside the ASF: 
>
> One of our users wants to contribute material on Spark[1]
>
> We've done something similar for ZooKeeper[1] in the past and the ZooKeeper 
> community provided excellent feedback which helped make the product much 
> better[3].
>
> That's why I'd like to invite everyone here to provide any kind of feedback 
> on the content donation. It is currently in PowerPoint format which makes it 
> a bit harder to review so we're happy to accept feedback in any form.
>
> The idea is to convert the material to AsciiDoc at some point.
>
> Cheers,
> Lars
>
> (I didn't want to cross post to user@ as well but this is obviously not 
> limited to dev@ users)
>
> [1] 
> [2] 
> [3] You can see the content here 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Apache Training contribution for Spark - Feedback welcome

2019-07-26 Thread Lars Francke
Hi Spark community,

you may or may not have heard of a new-ish (February 2019) project at
Apache: Apache Training (incubating). We aim to develop training material
about various projects inside and outside the ASF: <
http://training.apache.org/>

One of our users wants to contribute material on Spark[1]

We've done something similar for ZooKeeper[1] in the past and the ZooKeeper
community provided excellent feedback which helped make the product much
better[3].

That's why I'd like to invite everyone here to provide any kind of feedback
on the content donation. It is currently in PowerPoint format which makes
it a bit harder to review so we're happy to accept feedback in any form.

The idea is to convert the material to AsciiDoc at some point.

Cheers,
Lars

(I didn't want to cross post to user@ as well but this is obviously not
limited to dev@ users)

[1] 7>
[2] 3>
[3] You can see the content here <
https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc
>


Re: Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-09 Thread Hyukjin Kwon
It's alright - thanks for that.
Anyone can take a look. This is an open source project :D.

2019년 7월 9일 (화) 오후 8:18, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com>님이 작성:

> I can try one and see how it goes, although not familiar with the area.
>
> Stavros
>
> On Tue, Jul 9, 2019 at 6:17 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am currently targeting to improve Python, Pandas UDFs Scala UDF test
>> cases by integrating our existing *.sql files at
>> https://issues.apache.org/jira/browse/SPARK-27921
>>
>> I would appreciate that anyone who's interested in Spark contribution
>> takes some sub-tasks. It's too many for me to do :-). I am doing one by one
>> for now.
>>
>> I wrote some guides about this umbrella JIRA specifically so if you're
>> able to follow it very closely one by one, I think the process itself isn't
>> that difficult.
>>
>> The most import guide that should be carefully addressed is:
>> > 7. If there are diff, analyze it, file or find the JIRA, skip the tests
>> with comments.
>>
>> Thanks!
>>
>
>
>


Re: Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-09 Thread Stavros Kontopoulos
I can try one and see how it goes, although not familiar with the area.

Stavros

On Tue, Jul 9, 2019 at 6:17 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I am currently targeting to improve Python, Pandas UDFs Scala UDF test
> cases by integrating our existing *.sql files at
> https://issues.apache.org/jira/browse/SPARK-27921
>
> I would appreciate that anyone who's interested in Spark contribution
> takes some sub-tasks. It's too many for me to do :-). I am doing one by one
> for now.
>
> I wrote some guides about this umbrella JIRA specifically so if you're
> able to follow it very closely one by one, I think the process itself isn't
> that difficult.
>
> The most import guide that should be carefully addressed is:
> > 7. If there are diff, analyze it, file or find the JIRA, skip the tests
> with comments.
>
> Thanks!
>


Contribution help needed for sub-tasks of an umbrella JIRA - port *.sql tests to improve coverage of Python, Pandas, Scala UDF cases

2019-07-08 Thread Hyukjin Kwon
Hi all,

I am currently targeting to improve Python, Pandas UDFs Scala UDF test
cases by integrating our existing *.sql files at
https://issues.apache.org/jira/browse/SPARK-27921

I would appreciate that anyone who's interested in Spark contribution takes
some sub-tasks. It's too many for me to do :-). I am doing one by one for
now.

I wrote some guides about this umbrella JIRA specifically so if you're able
to follow it very closely one by one, I think the process itself isn't that
difficult.

The most import guide that should be carefully addressed is:
> 7. If there are diff, analyze it, file or find the JIRA, skip the tests
with comments.

Thanks!


Re: Contribution

2019-02-12 Thread Valeria Vasylieva
Hi Gabor,

Ok, sure I will!

Best regards,

Valeria

вт, 12 февр. 2019 г. в 17:00, Gabor Somogyi :

> Hi Valeria,
>
> Welcome, ping me if you need review.
>
> BR,
> G
>
>
> On Tue, Feb 12, 2019 at 2:51 PM Valeria Vasylieva <
> valeria.vasyli...@gmail.com> wrote:
>
>> Hi Gabor,
>>
>> Thank you for clarification! Will do it!
>> I am happy to join the community!
>>
>> Best Regards,
>> Valeria
>>
>> вт, 12 февр. 2019 г. в 16:32, Gabor Somogyi :
>>
>>> Hi Valeria,
>>>
>>> Glad to hear you would like to contribute! It will be assigned to you
>>> when you create a PR.
>>> Before you create it please read the following guide which describe the
>>> details: https://spark.apache.org/contributing.html
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Tue, Feb 12, 2019 at 2:28 PM Valeria Vasylieva <
>>> valeria.vasyli...@gmail.com> wrote:
>>>
 Hi!

 My name is Valeria Vasylieva and I would like to help with the task:
 https://issues.apache.org/jira/browse/SPARK-20597

 Please assign it to me, my JIRA account is:
 nimfadora (
 https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)

 Thank you!

>>>


Re: Contribution

2019-02-12 Thread Gabor Somogyi
Hi Valeria,

Welcome, ping me if you need review.

BR,
G


On Tue, Feb 12, 2019 at 2:51 PM Valeria Vasylieva <
valeria.vasyli...@gmail.com> wrote:

> Hi Gabor,
>
> Thank you for clarification! Will do it!
> I am happy to join the community!
>
> Best Regards,
> Valeria
>
> вт, 12 февр. 2019 г. в 16:32, Gabor Somogyi :
>
>> Hi Valeria,
>>
>> Glad to hear you would like to contribute! It will be assigned to you
>> when you create a PR.
>> Before you create it please read the following guide which describe the
>> details: https://spark.apache.org/contributing.html
>>
>> BR,
>> G
>>
>>
>> On Tue, Feb 12, 2019 at 2:28 PM Valeria Vasylieva <
>> valeria.vasyli...@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> My name is Valeria Vasylieva and I would like to help with the task:
>>> https://issues.apache.org/jira/browse/SPARK-20597
>>>
>>> Please assign it to me, my JIRA account is:
>>> nimfadora (
>>> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)
>>>
>>> Thank you!
>>>
>>


Re: Contribution

2019-02-12 Thread Valeria Vasylieva
Hi Gabor,

Thank you for clarification! Will do it!
I am happy to join the community!

Best Regards,
Valeria

вт, 12 февр. 2019 г. в 16:32, Gabor Somogyi :

> Hi Valeria,
>
> Glad to hear you would like to contribute! It will be assigned to you when
> you create a PR.
> Before you create it please read the following guide which describe the
> details: https://spark.apache.org/contributing.html
>
> BR,
> G
>
>
> On Tue, Feb 12, 2019 at 2:28 PM Valeria Vasylieva <
> valeria.vasyli...@gmail.com> wrote:
>
>> Hi!
>>
>> My name is Valeria Vasylieva and I would like to help with the task:
>> https://issues.apache.org/jira/browse/SPARK-20597
>>
>> Please assign it to me, my JIRA account is:
>> nimfadora (
>> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)
>>
>> Thank you!
>>
>


Re: Contribution

2019-02-12 Thread Gabor Somogyi
Hi Valeria,

Glad to hear you would like to contribute! It will be assigned to you when
you create a PR.
Before you create it please read the following guide which describe the
details: https://spark.apache.org/contributing.html

BR,
G


On Tue, Feb 12, 2019 at 2:28 PM Valeria Vasylieva <
valeria.vasyli...@gmail.com> wrote:

> Hi!
>
> My name is Valeria Vasylieva and I would like to help with the task:
> https://issues.apache.org/jira/browse/SPARK-20597
>
> Please assign it to me, my JIRA account is:
> nimfadora (
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)
>
> Thank you!
>


Contribution

2019-02-12 Thread Valeria Vasylieva
Hi!

My name is Valeria Vasylieva and I would like to help with the task:
https://issues.apache.org/jira/browse/SPARK-20597

Please assign it to me, my JIRA account is:
nimfadora (
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nimfadora)

Thank you!


Re: New to dev community | Contribution to Mlib

2017-09-22 Thread Driesprong, Fokko
Hi Venna,

Sounds like a very interesting algorithm. I have to agree with Seth, in the
end you don't want to add a lot of algorithms to Spark itself, it will blow
up the codebase and in the end the tests will run forever. You can also
consider publishing it to the Spark Packages website. I've also published
an outlier detection over there:
https://spark-packages.org/package/Fokko/spark-stochastic-outlier-selection

Cheers, Fokko

2017-09-22 2:10 GMT+02:00 Venali Sonone :

> Thank you for your response.
>
> The algorithm that I am proposing is Isolation Forest.
> Link to paper: paper
> . I
> particularly find that it should be included in Spark ML because so many
> applications that use Spark as part of real time streaming engine in
> industry need anomaly detection and current Spark ML supports it in some
> way by means clustering. I will probably start to create the implementation
> and prepare for proposal as you suggested.
>
> It is interesting to know that Spark is still implementing stuff in Spark
> ML to reach full parity with MLlib. Can I please get connected to folks
> working on it as I am interested in contributing. I have been heavy user of
> Spark since summer'15.
>
>  Cheers!
> -Venali
>
> On Thu, Sep 21, 2017 at 1:33 AM, Seth Hendrickson <
> seth.hendrickso...@gmail.com> wrote:
>
>> I'm not exactly clear on what you're proposing, but this sounds like
>> something that would live as a Spark package - a framework for anomaly
>> detection built on Spark. If there is some specific algorithm you have in
>> mind, it would be good to propose it on JIRA and discuss why you think it
>> needs to be included in Spark and not live as a Spark package.
>>
>> In general, there will probably be resistance to including new algorithms
>> in Spark ML, especially until the ML package has reached full parity with
>> MLlib. Still, if you can provide more details that will help to understand
>> what is best here.
>>
>> On Thu, Sep 14, 2017 at 1:29 AM, Venali Sonone 
>> wrote:
>>
>>>
>>> Hello,
>>>
>>> I am new to dev community of Spark and also open source in general but
>>> have used Spark extensively.
>>> I want to create a complete part on anomaly detection in spark Mlib,
>>> For the same I want to know if someone could guide me so i can start the
>>> development and contribute to Spark Mlib.
>>>
>>> Sorry for sounding naive if i do but any help is appreciated.
>>>
>>> Cheers!
>>> -venna
>>>
>>>
>>
>


Re: New to dev community | Contribution to Mlib

2017-09-21 Thread Venali Sonone
Thank you for your response.

The algorithm that I am proposing is Isolation Forest.
Link to paper: paper
. I
particularly find that it should be included in Spark ML because so many
applications that use Spark as part of real time streaming engine in
industry need anomaly detection and current Spark ML supports it in some
way by means clustering. I will probably start to create the implementation
and prepare for proposal as you suggested.

It is interesting to know that Spark is still implementing stuff in Spark
ML to reach full parity with MLlib. Can I please get connected to folks
working on it as I am interested in contributing. I have been heavy user of
Spark since summer'15.

 Cheers!
-Venali

On Thu, Sep 21, 2017 at 1:33 AM, Seth Hendrickson <
seth.hendrickso...@gmail.com> wrote:

> I'm not exactly clear on what you're proposing, but this sounds like
> something that would live as a Spark package - a framework for anomaly
> detection built on Spark. If there is some specific algorithm you have in
> mind, it would be good to propose it on JIRA and discuss why you think it
> needs to be included in Spark and not live as a Spark package.
>
> In general, there will probably be resistance to including new algorithms
> in Spark ML, especially until the ML package has reached full parity with
> MLlib. Still, if you can provide more details that will help to understand
> what is best here.
>
> On Thu, Sep 14, 2017 at 1:29 AM, Venali Sonone 
> wrote:
>
>>
>> Hello,
>>
>> I am new to dev community of Spark and also open source in general but
>> have used Spark extensively.
>> I want to create a complete part on anomaly detection in spark Mlib,
>> For the same I want to know if someone could guide me so i can start the
>> development and contribute to Spark Mlib.
>>
>> Sorry for sounding naive if i do but any help is appreciated.
>>
>> Cheers!
>> -venna
>>
>>
>


Re: New to dev community | Contribution to Mlib

2017-09-20 Thread Seth Hendrickson
I'm not exactly clear on what you're proposing, but this sounds like
something that would live as a Spark package - a framework for anomaly
detection built on Spark. If there is some specific algorithm you have in
mind, it would be good to propose it on JIRA and discuss why you think it
needs to be included in Spark and not live as a Spark package.

In general, there will probably be resistance to including new algorithms
in Spark ML, especially until the ML package has reached full parity with
MLlib. Still, if you can provide more details that will help to understand
what is best here.

On Thu, Sep 14, 2017 at 1:29 AM, Venali Sonone  wrote:

>
> Hello,
>
> I am new to dev community of Spark and also open source in general but
> have used Spark extensively.
> I want to create a complete part on anomaly detection in spark Mlib,
> For the same I want to know if someone could guide me so i can start the
> development and contribute to Spark Mlib.
>
> Sorry for sounding naive if i do but any help is appreciated.
>
> Cheers!
> -venna
>
>


New to dev community | Contribution to Mlib

2017-09-14 Thread Venali Sonone
Hello,

I am new to dev community of Spark and also open source in general but have
used Spark extensively.
I want to create a complete part on anomaly detection in spark Mlib,
For the same I want to know if someone could guide me so i can start the
development and contribute to Spark Mlib.

Sorry for sounding naive if i do but any help is appreciated.

Cheers!
-venna


New to dev community | Contribution to Mlib

2017-09-13 Thread Venali Sonone
Hello,

I am new to dev community of Spark and also open source in general but have
used Spark extensively.
I want to create a complete part on anomaly detection in spark Mlib,
For the same I want to know if someone could guide me so i can start the
development and contribute to Spark Mlib.

Sorry for sounding naive if i do but any help is appreciated.

Cheers!
-venna


Re: Apache Spark Contribution

2017-02-03 Thread Steve Loughran
You might want to look at Nephele: Efficient Parallel Data Processing in the 
Cloud, Warneke & Kao, 2009

http://stratosphere.eu/assets/papers/Nephele_09.pdf

This was some of the work done in the research project with gave birth to 
Flink, though this bit didn't surface as they chose to leave VM allocation to 
others.

essentially: the query planner could track allocations and lifespans of work, 
know that if a VM were to be released, pick the one closest to its our being 
up, let you choose between fast but expensive vs slow but (maybe) less 
expensive, etc, etc.

It's a complex problem, as to do it you need to think about more than just spot 
load, more "how to efficiently divide work amongst a pool of machines with 
different lifespans"

what could be good to look at today would be rather than hard code the logic

-provide metrics information which higher level tools could use to make 
decisions/send hints down
-maybe schedule things to best support pre-emptible nodes in the cluster; the 
ones where you bid spot prices for from EC2, get 1 hour guaranteed, then after 
they can be killed without warning.

preemption-aware scheduling might imply making sure that any critical 
information is kept out the preemptible nodes, or at least replicated onto a 
long-lived one, and have stuff in the controller ready to react to unannounced 
pre-emption. FWIW when YARN preempts you do get notified, and maybe even some 
very early warning. I don't know if spark uses that.

There is some support in HDFS for declaring that some nodes have interdependent 
failures, "failure domains", so you could use that to have HDFS handle 
replication and only store 1 copy on preemptible VMs, leaving only the 
scheduling and recovery problem.

Finally, YARN container resizing: lets you ask for more resources when busy, 
release them when idle. This may be good for CPU load, though memory management 
isn't something programs can ever handle

On 2 Feb 2017, at 19:05, Gabi Cristache 
> wrote:

Hello,

My name is Gabriel Cristache and I am a student in my final year of a Computer 
Engineering/Science University. I want for my Bachelor Thesis to add support 
for dynamic scaling to a spark streaming application.

The goal of the project is to develop an algorithm that automatically scales 
the cluster up and down based on the volume of data processed by the 
application.
You will need to balance between quick reaction to traffic spikes (scale up) 
and avoiding wasted resources (scale down) by implementing something along the 
lines of a PID algorithm.


 Do you think this is feasible? And if so are there any hints that you could 
give me that would help my objective?

Thanks,
Gabriel Cristache



Apache Spark Contribution

2017-02-02 Thread Gabi Cristache
Hello,

My name is Gabriel Cristache and I am a student in my final year of a
Computer Engineering/Science University. I want for my Bachelor Thesis to
add support for dynamic scaling to a spark streaming application.


*The goal of the project is to develop an algorithm that automatically
scales the cluster up and down based on the volume of data processed by the
application.*

*You will need to balance between quick reaction to traffic spikes (scale
up) and avoiding wasted resources (scale down) by implementing something
along the lines of a PID algorithm.*



 Do you think this is feasible? And if so are there any hints that you
could give me that would help my objective?


Thanks,

Gabriel Cristache


Contribution to Apache Spark

2016-09-03 Thread aditya1702
Hello,
I am Aditya Vyas and I am currently in my third year of college doing BTech
in my engineering. I know python, a little bit of Java. I want to start
contribution in Apache Spark. This is my first time in the field of Big
Data. Can someone please help me as to how to get started. Which resources
to look at?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-to-Apache-Spark-tp18852.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Possible contribution to MLlib

2016-06-21 Thread Jeff Zhang
I think it is valuable to make the distance function pluggable and also
provide some builtin distance function. This might be also useful for other
algorithms besides KMeans.

On Tue, Jun 21, 2016 at 7:48 PM, Simon NANTY 
wrote:

> Hi all,
>
>
>
> In my team, we are currently developing a fork of spark MLlib extending
> K-means method such that it is possible to set its own distance function.
> In this implementation, it could be possible to directly pass, in argument
> of the K-means train function, a distance function whose signature is:
> (VectorWithNorm, VectorWithNorm) => Double.
>
>
>
> We have found the Jira instance SPARK-11665 proposing to support new
> distances in bisecting K-means. There has also been the Jira instance
> SPARK-3219 proposing to add Bregman divergences as distance functions, but
> it has not been added to MLlib. Therefore, we are wondering if such an
> extension of MLlib K-means algorithm would be appreciated by the community
> and would have chances to get included in future spark releases.
>
>
>
> Regards,
>
>
>
> Simon Nanty
>
>
>



-- 
Best Regards

Jeff Zhang


Possible contribution to MLlib

2016-06-21 Thread Simon NANTY
Hi all,

In my team, we are currently developing a fork of spark MLlib extending K-means 
method such that it is possible to set its own distance function. In this 
implementation, it could be possible to directly pass, in argument of the 
K-means train function, a distance function whose signature is: 
(VectorWithNorm, VectorWithNorm) => Double.

We have found the Jira instance SPARK-11665 proposing to support new distances 
in bisecting K-means. There has also been the Jira instance SPARK-3219 
proposing to add Bregman divergences as distance functions, but it has not been 
added to MLlib. Therefore, we are wondering if such an extension of MLlib 
K-means algorithm would be appreciated by the community and would have chances 
to get included in future spark releases.

Regards,

Simon Nanty



Re: Unchecked contribution (JIRA and PR)

2015-11-26 Thread Sergio Ramírez

OK, I'll do that. Thanks for the response.

El 17/11/15 a las 01:36, Joseph Bradley escribió:

Hi Sergio,

Apart from apologies about limited review bandwidth (from me too!), I 
wanted to add: It would be interesting to hear what feedback you've 
gotten from users of your package. Perhaps you could collect feedback 
by (a) emailing the user list and (b) adding a note in the Spark 
Packages pointing to the JIRA, and encouraging users to add their 
comments directly to the JIRA.  That'd be a nice way to get a sense of 
use cases and priority.


Thanks for your patience,
Joseph

On Wed, Nov 4, 2015 at 7:23 AM, Sergio Ramírez > wrote:


OK, for me, time is not a problem. I was just worried about there
was no movement in those issues. I think they are good
contributions. For example, I have found no complex discretization
algorithm in MLlib, which is rare. My algorithm, a Spark
implementation of the well-know discretizer developed by Fayyad
and Irani, could be considered a good starting point for the
discretization part. Furthermore, this is also supported by two
scientific articles.

Anyway, I uploaded these two algorithms as two different packages
to spark-packages.org , but I would
like to contribute directly to MLlib. I understand you have a lot
of requests, and it is not possible to include all the
contributions made by the Spark community.

I'll be patient and ready to collaborate.

Thanks again


On 03/11/15 16:30, Jerry Lam wrote:

Sergio, you are not alone for sure. Check the RowSimilarity
implementation [SPARK-4823]. It has been there for 6 months. It
is very likely those which don't merge in the version of spark
that it was developed will never merged because spark changes
quite significantly from version to version if the algorithm
depends a lot of internal api.

On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin > wrote:

Sergio,

Usually it takes a lot of effort to get something merged into
Spark itself, especially for relatively new algorithms that
might not have established itself yet. I will leave it to
mllib maintainers to comment on the specifics of the
individual algorithms proposed here.

Just another general comment: we have been working on making
packages be as easy to use as possible for Spark users. Right
now it only requires a simple flag to pass to the
spark-submit script to include a package.


On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez
> wrote:

Hello all:

I developed two packages for MLlib in March. These have
been also upload to the spark-packages repository.
Associated to these packages, I created two JIRA's
threads and the correspondent pull requests, which are
listed below:

https://github.com/apache/spark/pull/5184
https://github.com/apache/spark/pull/5170

https://issues.apache.org/jira/browse/SPARK-6531
https://issues.apache.org/jira/browse/SPARK-6509

These remain unassigned in JIRA and unverified in GitHub.

Could anyone explain why are they in this state yet? Is
it normal?

Thanks!

Sergio R.

-- 


Sergio Ramírez Gallego
Research group on Soft Computing and Intelligent
Information Systems,
Dept. Computer Science and Artificial Intelligence,
University of Granada, Granada, Spain.
Email: srami...@decsai.ugr.es 
Research Group URL: http://sci2s.ugr.es/


-

Este correo electrónico y, en su caso, cualquier fichero
anexo al mismo,
contiene información de carácter confidencial
exclusivamente dirigida a
su destinatario o destinatarios. Si no es vd. el
destinatario indicado,
queda notificado que la lectura, utilización, divulgación
y/o copia sin
autorización está prohibida en virtud de la legislación
vigente. En el
caso de haber recibido este correo electrónico por error,
se ruega
notificar inmediatamente esta circunstancia mediante
reenvío a la
dirección electrónica del remitente.
Evite imprimir este mensaje si no es estrictamente necesario.

This email and any file attached to it (when applicable)
contain(s)
confidential information that is exclusively addressed to its
recipient(s). If you are not the 

Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley
Hi Sergio,

Apart from apologies about limited review bandwidth (from me too!), I
wanted to add: It would be interesting to hear what feedback you've gotten
from users of your package.  Perhaps you could collect feedback by (a)
emailing the user list and (b) adding a note in the Spark Packages pointing
to the JIRA, and encouraging users to add their comments directly to the
JIRA.  That'd be a nice way to get a sense of use cases and priority.

Thanks for your patience,
Joseph

On Wed, Nov 4, 2015 at 7:23 AM, Sergio Ramírez  wrote:

> OK, for me, time is not a problem. I was just worried about there was no
> movement in those issues. I think they are good contributions. For example,
> I have found no complex discretization algorithm in MLlib, which is rare.
> My algorithm, a Spark implementation of the well-know discretizer developed
> by Fayyad and Irani, could be considered a good starting point for the
> discretization part. Furthermore, this is also supported by two scientific
> articles.
>
> Anyway, I uploaded these two algorithms as two different packages to
> spark-packages.org, but I would like to contribute directly to MLlib. I
> understand you have a lot of requests, and it is not possible to include
> all the contributions made by the Spark community.
>
> I'll be patient and ready to collaborate.
>
> Thanks again
>
>
> On 03/11/15 16:30, Jerry Lam wrote:
>
> Sergio, you are not alone for sure. Check the RowSimilarity implementation
> [SPARK-4823]. It has been there for 6 months. It is very likely those which
> don't merge in the version of spark that it was developed will never merged
> because spark changes quite significantly from version to version if the
> algorithm depends a lot of internal api.
>
> On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin  wrote:
>
>> Sergio,
>>
>> Usually it takes a lot of effort to get something merged into Spark
>> itself, especially for relatively new algorithms that might not have
>> established itself yet. I will leave it to mllib maintainers to comment on
>> the specifics of the individual algorithms proposed here.
>>
>> Just another general comment: we have been working on making packages be
>> as easy to use as possible for Spark users. Right now it only requires a
>> simple flag to pass to the spark-submit script to include a package.
>>
>>
>> On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez < 
>> sramire...@ugr.es> wrote:
>>
>>> Hello all:
>>>
>>> I developed two packages for MLlib in March. These have been also upload
>>> to the spark-packages repository. Associated to these packages, I created
>>> two JIRA's threads and the correspondent pull requests, which are listed
>>> below:
>>>
>>> https://github.com/apache/spark/pull/5184
>>> https://github.com/apache/spark/pull/5170
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6531
>>> https://issues.apache.org/jira/browse/SPARK-6509
>>>
>>> These remain unassigned in JIRA and unverified in GitHub.
>>>
>>> Could anyone explain why are they in this state yet? Is it normal?
>>>
>>> Thanks!
>>>
>>> Sergio R.
>>>
>>> --
>>>
>>> Sergio Ramírez Gallego
>>> Research group on Soft Computing and Intelligent Information Systems,
>>> Dept. Computer Science and Artificial Intelligence,
>>> University of Granada, Granada, Spain.
>>> Email: srami...@decsai.ugr.es
>>> Research Group URL: http://sci2s.ugr.es/
>>>
>>> -
>>>
>>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>>> contiene información de carácter confidencial exclusivamente dirigida a
>>> su destinatario o destinatarios. Si no es vd. el destinatario indicado,
>>> queda notificado que la lectura, utilización, divulgación y/o copia sin
>>> autorización está prohibida en virtud de la legislación vigente. En el
>>> caso de haber recibido este correo electrónico por error, se ruega
>>> notificar inmediatamente esta circunstancia mediante reenvío a la
>>> dirección electrónica del remitente.
>>> Evite imprimir este mensaje si no es estrictamente necesario.
>>>
>>> This email and any file attached to it (when applicable) contain(s)
>>> confidential information that is exclusively addressed to its
>>> recipient(s). If you are not the indicated recipient, you are informed
>>> that reading, using, disseminating and/or copying it without
>>> authorisation is forbidden in accordance with the legislation in effect.
>>> If you have received this email by mistake, please immediately notify
>>> the sender of the situation by resending it to their email address.
>>> Avoid printing this message if it is not absolutely necessary.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: 
>>> dev-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
>
> Sergio Ramírez Gallego
> Research group 

Re: Unchecked contribution (JIRA and PR)

2015-11-03 Thread Jerry Lam
Sergio, you are not alone for sure. Check the RowSimilarity implementation
[SPARK-4823]. It has been there for 6 months. It is very likely those which
don't merge in the version of spark that it was developed will never merged
because spark changes quite significantly from version to version if the
algorithm depends a lot of internal api.

On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin  wrote:

> Sergio,
>
> Usually it takes a lot of effort to get something merged into Spark
> itself, especially for relatively new algorithms that might not have
> established itself yet. I will leave it to mllib maintainers to comment on
> the specifics of the individual algorithms proposed here.
>
> Just another general comment: we have been working on making packages be
> as easy to use as possible for Spark users. Right now it only requires a
> simple flag to pass to the spark-submit script to include a package.
>
>
> On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez  wrote:
>
>> Hello all:
>>
>> I developed two packages for MLlib in March. These have been also upload
>> to the spark-packages repository. Associated to these packages, I created
>> two JIRA's threads and the correspondent pull requests, which are listed
>> below:
>>
>> https://github.com/apache/spark/pull/5184
>> https://github.com/apache/spark/pull/5170
>>
>> https://issues.apache.org/jira/browse/SPARK-6531
>> https://issues.apache.org/jira/browse/SPARK-6509
>>
>> These remain unassigned in JIRA and unverified in GitHub.
>>
>> Could anyone explain why are they in this state yet? Is it normal?
>>
>> Thanks!
>>
>> Sergio R.
>>
>> --
>>
>> Sergio Ramírez Gallego
>> Research group on Soft Computing and Intelligent Information Systems,
>> Dept. Computer Science and Artificial Intelligence,
>> University of Granada, Granada, Spain.
>> Email: srami...@decsai.ugr.es
>> Research Group URL: http://sci2s.ugr.es/
>>
>> -
>>
>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>> contiene información de carácter confidencial exclusivamente dirigida a
>> su destinatario o destinatarios. Si no es vd. el destinatario indicado,
>> queda notificado que la lectura, utilización, divulgación y/o copia sin
>> autorización está prohibida en virtud de la legislación vigente. En el
>> caso de haber recibido este correo electrónico por error, se ruega
>> notificar inmediatamente esta circunstancia mediante reenvío a la
>> dirección electrónica del remitente.
>> Evite imprimir este mensaje si no es estrictamente necesario.
>>
>> This email and any file attached to it (when applicable) contain(s)
>> confidential information that is exclusively addressed to its
>> recipient(s). If you are not the indicated recipient, you are informed
>> that reading, using, disseminating and/or copying it without
>> authorisation is forbidden in accordance with the legislation in effect.
>> If you have received this email by mistake, please immediately notify
>> the sender of the situation by resending it to their email address.
>> Avoid printing this message if it is not absolutely necessary.
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


MLlib Contribution

2015-10-15 Thread Kybe67
Hi, i made a clustering algorithm in Scala/Spark during my internship, i
would like to contribute to MLlib, but i don't know how, i do my best to
follow this instructions :

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

The algorithm is the Mean Shift. It works fine on multivariate
muldimensional datasets, especially on image. I think some works should be
done but i don't know what i should do.

Thank you for your support and for the amazing Spark project.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contribution-tp14626.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Contribution

2015-06-14 Thread Joseph Bradley
+1 for checking out the Wiki on Contributing to Spark.  It gives helpful
pointers about finding starter JIRAs, the discussion  code review process,
and how we prioritize algorithms  other contributions.  After you read
that, I would recommend searching JIRA for issues which catch your interest.
Thanks!
Joseph

On Sat, Jun 13, 2015 at 3:55 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 This is a good start, if you haven't seen this already
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Thanks
 Best Regards

 On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 
 sreenivas.raghav...@gmail.com wrote:

 Hi everyone,
  I am interest to contribute new algorithms and optimize
 existing algorithms in the area of graph algorithms and machine learning.
 Please give me some ideas where to start. Is it possible for me to
 introduce
 the notion of neural network in the apache spark



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





RE: Contribution

2015-06-13 Thread Eron Wright
The deeplearning4j project provides neural net algorithms for Spark ML.   You 
may consider it sample code for extending Spark with new ML algorithms.

http://deeplearning4j.org/sparkmlhttps://github.com/deeplearning4j/deeplearning4j/tree/master/deeplearning4j-scaleout/spark/dl4j-spark-ml
-Eron
 Date: Fri, 12 Jun 2015 20:16:33 -0700
 From: sreenivas.raghav...@gmail.com
 To: dev@spark.apache.org
 Subject: Contribution
 
 Hi everyone,
  I am interest to contribute new algorithms and optimize
 existing algorithms in the area of graph algorithms and machine learning.
 Please give me some ideas where to start. Is it possible for me to introduce
 the notion of neural network in the apache spark
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 
  

Re: Contribution

2015-06-13 Thread Akhil Das
This is a good start, if you haven't seen this already
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Thanks
Best Regards

On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 
sreenivas.raghav...@gmail.com wrote:

 Hi everyone,
  I am interest to contribute new algorithms and optimize
 existing algorithms in the area of graph algorithms and machine learning.
 Please give me some ideas where to start. Is it possible for me to
 introduce
 the notion of neural network in the apache spark



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Contribution

2015-06-12 Thread srinivasraghavansr71
Hi everyone,
 I am interest to contribute new algorithms and optimize
existing algorithms in the area of graph algorithms and machine learning.
Please give me some ideas where to start. Is it possible for me to introduce
the notion of neural network in the apache spark



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Contribution-tp12739.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-14 Thread Imran Rashid
These are great questions -- I dunno the answer to most of them, but I'll
try to at least give my take on What should be rejected and why?

For new features, I'm often really confused by our guidelines on what to
include and what to exclude.  Maybe we should ask that all new features
make it clear why they should *not* just be a separate package.

Bug fixes are also a little tricky.  On the one hand, its hard to say no to
them -- everyone wants all the bugs fixed.  But I think its actually a lot
harder for someone that isn't experienced with spark to fix a bug in a
clean way, when they don't know the code base.  Often the proposed fixes
are just kludges tacked on somewhere rather than addressing the real
problem.  It might help to clearly say that the most useful thing they can
do is submit bug reports with simple steps to reproduce, or even better to
submit a failing test case.  Of course submitting a patch is great too, but
we could be clear that patches would only be accepted if they fit in the
long-term design for spark.

I really feel that saying no more directly would be very helpful.
Actually I think one of the most discouraging things we can do is give a
soft no -- say oh that sounds interesting, but then let the PR languish.

thanks for pushing on this Sean, really useful to have this discussion.

On Tue, Apr 14, 2015 at 10:02 AM, Sean Owen so...@cloudera.com wrote:

 Bringing a discussion to dev@. I think the general questions on the table
 are:

 - Should more changes be rejected? What are the pros/cons of that?
 - If no, how do you think about the very large backlog of PRs and JIRAs?
 - What should be rejected and why?
 - How much support is there for proactively cleaning house now? What
 would you close and why?
 - What steps can be taken to prevent people from wasting time on JIRAs
 / PRs that will be rejected?
 - What if anything does this tell us about the patterns of project
 planning to date and what can we learn?

 This overlaps with other discussion on SPARK-6889 but per Nicholas
 wanted to surface this

 -- Forwarded message --
 From: Nicholas Chammas (JIRA) j...@apache.org
 Date: Tue, Apr 14, 2015 at 3:38 PM
 Subject: [jira] [Commented] (SPARK-6889) Streamline contribution
 process with update to Contribution wiki, JIRA rules
 To: iss...@spark.apache.org


 Nicholas Chammas commented on SPARK-6889:
 -

 {quote}
 I also agree that most projects don't say no enough and it's
 actually bad for everyone. Yes, one goal was to also set more
 expectation that lots of changes are rejected. If there is widespread
 agreement, I'd also like firmer language in the guide. As you say it
 is also a matter of taste and culture, but, I'd personally favor a lot
 more no.
 {quote}

 Regarding this point about culture, should we have some kind of
 discussion on the dev list to nudge people in the right direction?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Contribution in java

2014-12-20 Thread Koert Kuipers
yes it does. although the core of spark is written in scala it also
maintains java and python apis, and there is plenty of work for those to
contribute to.

On Sat, Dec 20, 2014 at 7:30 AM, sreenivas putta putta.sreeni...@gmail.com
wrote:

 Hi,

 I want to contribute for spark in java. Does it support java? please let me
 know.

 Thanks,
 Sreenivas



Re: Contribution in java

2014-12-20 Thread vaquar khan
Hi Sreenivas,

Please read Spark doc first, everything mention in doc , without reading
doc how can you contribute ?

regards,
vaquar khan

On Sat, Dec 20, 2014 at 6:00 PM, sreenivas putta putta.sreeni...@gmail.com
wrote:

 Hi,

 I want to contribute for spark in java. Does it support java? please let me
 know.

 Thanks,
 Sreenivas




-- 
Regards,
Vaquar Khan
+91 830-851-1500


Re: Spark Contribution

2014-08-23 Thread Nicholas Chammas
That sounds like a good idea.

Continuing along those lines, what do people think of moving the
contributing page entirely from the wiki to GitHub? It feels like the right
place for it since GitHub is where we take contributions, and it also lets
people make improvements to it.

Nick


2014년 8월 23일 토요일, Sean Owenso...@cloudera.com님이 작성한 메시지:

 Can I ask a related question, since I have a PR open to touch up
 README.md as we speak (SPARK-3069)?

 If this text is in a file called CONTRIBUTING.md, then it will cause a
 link to appear on the pull request screen, inviting people to review
 the contribution guidelines:

 https://github.com/blog/1184-contributing-guidelines

 This is mildly important as the project wants to make it clear that
 you agree that your contribution is licensed under the AL2, since
 there is no formal ICLA.

 How about I propose moving the text to CONTRIBUTING.md with a pointer
 in README.md? or keep it both places?

 On Sat, Aug 23, 2014 at 1:08 AM, Reynold Xin r...@databricks.com
 javascript:; wrote:
  Great idea. Added the link
  https://github.com/apache/spark/blob/master/README.md
 
 
 
  On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com javascript:; wrote:
 
  We should add this link to the readme on GitHub btw.
 
  2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com javascript:;님이
 작성한 메시지:
 
   The Apache Spark wiki on how to contribute should be great place to
   start:
  
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  
   - Henry
  
   On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
 javascript:;
   javascript:; wrote:
Hi,
   
Can someone help me with some links on how to contribute for Spark
   
Regards
mns
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 javascript:; javascript:;
   For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
  javascript:;
  
  
 



Re: Spark Contribution

2014-08-22 Thread Reynold Xin
Great idea. Added the link
https://github.com/apache/spark/blob/master/README.md



On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 We should add this link to the readme on GitHub btw.

 2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com님이 작성한 메시지:

  The Apache Spark wiki on how to contribute should be great place to
  start:
  https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 
  - Henry
 
  On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
  javascript:; wrote:
   Hi,
  
   Can someone help me with some links on how to contribute for Spark
  
   Regards
   mns
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
 
 



Re: Spark Contribution

2014-08-22 Thread Maisnam Ns
Thanks all, for adding this link .


On Sat, Aug 23, 2014 at 5:38 AM, Reynold Xin r...@databricks.com wrote:

 Great idea. Added the link
 https://github.com/apache/spark/blob/master/README.md



 On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 We should add this link to the readme on GitHub btw.

 2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com님이 작성한 메시지:

  The Apache Spark wiki on how to contribute should be great place to
  start:
  https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 
  - Henry
 
  On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
  javascript:; wrote:
   Hi,
  
   Can someone help me with some links on how to contribute for Spark
  
   Regards
   mns
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
 
 





Spark Contribution

2014-08-21 Thread Maisnam Ns
Hi,

Can someone help me with some links on how to contribute for Spark

Regards
mns


Re: Spark Contribution

2014-08-21 Thread Henry Saputra
The Apache Spark wiki on how to contribute should be great place to
start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

- Henry

On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com wrote:
 Hi,

 Can someone help me with some links on how to contribute for Spark

 Regards
 mns

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark Contribution

2014-08-21 Thread Nicholas Chammas
We should add this link to the readme on GitHub btw.

2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com님이 작성한 메시지:

 The Apache Spark wiki on how to contribute should be great place to
 start:
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 - Henry

 On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
 javascript:; wrote:
  Hi,
 
  Can someone help me with some links on how to contribute for Spark
 
  Regards
  mns

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: dev-h...@spark.apache.org javascript:;




Contribution to MLlib

2014-07-09 Thread MEETHU MATHEW
Hi,

I am interested in contributing a clustering algorithm towards MLlib of Spark.I 
am focusing on Gaussian Mixture Model.
But I saw a JIRA @ https://spark-project.atlassian.net/browse/SPARK-952 
regrading the same.I would like to know whether Gaussian Mixture Model is 
already implemented or not.



Thanks  Regards, 
Meethu M

Re: Contribution to MLlib

2014-07-09 Thread RJ Nowling
Hi Meethu,

There is no code for a Gaussian Mixture Model clustering algorithm in the
repository, but I don't know if anyone is working on it.

RJ

On Wednesday, July 9, 2014, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

 Hi,

 I am interested in contributing a clustering algorithm towards MLlib of
 Spark.I am focusing on Gaussian Mixture Model.
 But I saw a JIRA @ https://spark-project.atlassian.net/browse/SPARK-952
 regrading the same.I would like to know whether Gaussian Mixture Model
 is already implemented or not.



 Thanks  Regards,
 Meethu M



-- 
em rnowl...@gmail.com
c 954.496.2314


Re: Contribution to MLlib

2014-07-09 Thread Xiangrui Meng
I don't know if anyone is working on it either. If that JIRA is not
moved to Apache JIRA, feel free to create a new one and make a note
that you are working on it. Thanks! -Xiangrui

On Wed, Jul 9, 2014 at 4:56 AM, RJ Nowling rnowl...@gmail.com wrote:
 Hi Meethu,

 There is no code for a Gaussian Mixture Model clustering algorithm in the
 repository, but I don't know if anyone is working on it.

 RJ

 On Wednesday, July 9, 2014, MEETHU MATHEW meethu2...@yahoo.co.in wrote:

 Hi,

 I am interested in contributing a clustering algorithm towards MLlib of
 Spark.I am focusing on Gaussian Mixture Model.
 But I saw a JIRA @ https://spark-project.atlassian.net/browse/SPARK-952
 regrading the same.I would like to know whether Gaussian Mixture Model
 is already implemented or not.



 Thanks  Regards,
 Meethu M



 --
 em rnowl...@gmail.com
 c 954.496.2314