[jira] [Comment Edited] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952618#comment-15952618 ] Cyril de Vogelaere edited comment on SPARK-20180 at 4/2/17 4:50 PM: Yes they can, it's really not a critical issue at all. Current pattern length work also well for most in practice, except for very large datasets where sequence are very long. But then I suppose people would know about the parameter, and set it to a large value. However allowing unlimitted pattern length would cost nothing in terms of performance, it's just an additionnal condition in an if. And may be easier than always setting the highest value possible. At least, that option wouldn't hurt and there was a TODO for that in the code. I think it would be good, even if we don't change the default value of 10. Changing it make more sense to me, but I get that, to allow backward compatibility, we can't just change things as we want. So I will follow my senior's opinion on this. Actually, I have quite a few improvement in store for Prefix-span since I worked on an algorithm for my master thesis. Notably a very performant implementation that specialize PrefixSpan for single-item pattern, while slightly improving the performance of multi-item pattern. But I was told I needed to get familiar with contributing to spark first ^^', thus why I'm proposing this small, non critical, improvement, and implementing it. I'm ready to push this small change anytime, it's already implemented. But the contributor wiki ask to run dev/run-tests before pushing, and it's been running for a day and a half already ... Is that normal by the way ? Also, the test already found some error, but I'm 99.999% sure they're not mine. They're not even from the mllib module, which is the only thing I modified ... Is that normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^' was (Author: syrux): Yes they can, it's really not a critical issue at all. Current pattern length work also well for most in practice, except for very large datasets where sequence are very long. But then I suppose people would know about the parameter, and set it to a large value. However changing it to create a default value allowing unlimitted pattern length would cost nothing in terms of performance, it's just an additionnal condition in an if. And may be easier than always setting the highest value possible. At least, that option wouldn't hurt and there was a TODO for that in the code. Actually, I have quite a few improvement in store for Prefix-span since I worked on an algorithm for my master thesis. Notably a very performant implementation that specialize PrefixSpan for single-item pattern, while slightly improving the performance of multi-item pattern. But I was told I needed to get familiar with contributing to spark first ^^', thus why I'm proposing this small, non critical, improvement, and implementing it. I'm ready to push this small change anytime, it's already implemented. But the contributor wiki ask to run dev/run-tests before pushing, and it's been running for a day and a half already ... Is that normal by the way ? Also, the test already found some error, but I'm 99.999% sure they're not mine. They're not even from the mllib module, which is the only thing I modified ... Is that normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^' > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952618#comment-15952618 ] Cyril de Vogelaere edited comment on SPARK-20180 at 4/2/17 10:39 AM: - Yes they can, it's really not a critical issue at all. Current pattern length work also well for most in practice, except for very large datasets where sequence are very long. But then I suppose people would know about the parameter, and set it to a large value. However changing it to create a default value allowing unlimitted pattern length would cost nothing in terms of performance, it's just an additionnal condition in an if. And may be easier than always setting the highest value possible. At least, that option wouldn't hurt and there was a TODO for that in the code. Actually, I have quite a few improvement in store for Prefix-span since I worked on an algorithm for my master thesis. Notably a very performant implementation that specialize PrefixSpan for single-item pattern, while slightly improving the performance of multi-item pattern. But I was told I needed to get familiar with contributing to spark first ^^', thus why I'm proposing this small, non critical, improvement, and implementing it. I'm ready to push this small change anytime, it's already implemented. But the contributor wiki ask to run dev/run-tests before pushing, and it's been running for a day and a half already ... Is that normal by the way ? Also, the test already found some error, but I'm 99.999% sure they're not mine. They're not even from the mllib module, which is the only thing I modified ... Is that normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^' was (Author: syrux): Yes they can, it's really not a critical issue at all. Current pattern length work also well for most in practice, except for very large datasets where sequence are very long. But then I suppose people would know about the parameter, and set it to a large value. However changing it to create a default value allowing unlimitted pattern length would cost nothing in terms of performance, it's just an additionnal condition in an if. And may be easier than always setting the highest value possible. At least, that option wouldn't hurt. Actually, I have quite a few improvement in store for Prefix-span since I worked on an algorithm for my master thesis. Notably a very performant implementation that specialize PrefixSpan for single-item pattern, while slightly improving the performance of multi-item pattern. But I was told I needed to get familiar with contributing to spark first ^^', thus why I'm proposing this small, non critical, improvement, and implementing it. I'm ready to push this small change anytime, it's already implemented. But the contributor wiki ask to run dev/run-tests before pushing, and it's been running for a day and a half already ... Is that normal by the way ? Also, the test already found some error, but I'm 99.999% sure they're not mine. They're not even from the mllib module, which is the only thing I modified ... Is that normal too ? I suppose so, but I wouldn't want to waste the reviewers time ^^' > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20180) Unlimited max pattern length in Prefix span
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952377#comment-15952377 ] yuhao yang edited comment on SPARK-20180 at 4/1/17 8:14 PM: I assume user can achieve the same effect by setting maxPatternlength to a larger value. So the jira is really about changing the default behavior of PrefixSpan. Is there more background or context available, like why the current default length(10) is not good in practice? Thanks. We need to also consider the performance for larger dataset (in count and dimension). was (Author: yuhaoyan): I assume user can achieve the same effect by setting maxPatternlength to a larger value. So the jira is really about changing the default behavior of PrefixSpan. Is there more background or context available, like why the current default length(10) is not good in practice? Thanks. > Unlimited max pattern length in Prefix span > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org