[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93573955
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java
 ---
@@ -59,4 +60,49 @@ public boolean isInitialized() {
   return true;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(name = "BIN"
+  , description = "Computes the bin that the value is in given a 
set of bounds."
+  , params = {
+   "value - The value to bin"
+  , "bounds - A list of value bounds (excluding min and max) in 
sorted order."
+}
+  ,returns = "Which bin N the value falls in such that bound(N-1) 
< value <= bound(N). " +
+  "No min and max bounds are provided, so values smaller than the 
0'th bound go in the 0'th bin, " +
+  "and values greater than the last bound go in the M'th bin."
+  )
+  public static class Bin extends BaseStellarFunction {
+
+public static int getBin(double value, List bins, 
Function boundFunc) {
+  for(int bin = 0; bin < bins.size();++bin) {
+double bound = boundFunc.apply(bin);
+if(value <= bound) {
+  return bin;
+}
+  }
+  return bins.size();
+}
+
+@Override
+public Object apply(List args) {
+  Double value = convert(args.get(0), Double.class);
+  List bins = new ArrayList<>();
+  if (args.size() > 1) {
+List objList = convert(args.get(1), List.class);
+if(objList == null) {
+  return null;
+}
+for(Number n : objList) {
+  bins.add(n.doubleValue());
--- End diff --

During this step we need to validate that the bounds list is strictly 
increasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93574334
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java
 ---
@@ -425,4 +428,61 @@ public Object apply(List args) {
   return result;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(namespace = "STATS", name = "BIN"
+  , description = "Computes the bin that the value is in based on 
the statistical distribution."
+  , params = {
+  "stats - The Stellar statistics object"
+  , "value - The value to bin"
+  , "bounds? - A list of percentile bin bounds (excluding min and 
max) or a string representing a known and common set of bins.  " +
+  "For convenience, we have provided QUARTILE, QUINTILE, and 
DECILE which you can pass in as a string arg." +
+  " If this argument is omitted, then we assume a Quartile bin 
split."
+}
+  ,returns = "Which bin N the value falls in such that bound(N-1) 
< value <= bound(N). " +
+  "No min and max bounds are provided, so values smaller than the 
0'th bound go in the 0'th bin, " +
+  "and values greater than the last bound go in the M'th bin."
+  )
+  public static class StatsBin extends BaseStellarFunction {
+public enum BinSplits {
+  QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
+  QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
+  DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 
80.0, 90.0))
+  ;
+  public final List split;
+  BinSplits(List split) {
+this.split = split;
+  }
+
+  public static List getSplit(Object o) {
+if(o instanceof String) {
+  return BinSplits.valueOf((String)o).split;
+}
+else if(o instanceof List) {
+  List ret = new ArrayList<>();
+  for(Object valO : (List)o) {
+ret.add(ConversionUtils.convert(valO, Double.class));
--- End diff --

During this step we need to validate that the bounds list is strictly 
increasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93573772
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/MathFunctions.java
 ---
@@ -59,4 +60,49 @@ public boolean isInitialized() {
   return true;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(name = "BIN"
+  , description = "Computes the bin that the value is in given a 
set of bounds."
+  , params = {
+   "value - The value to bin"
+  , "bounds - A list of value bounds (excluding min and max) in 
sorted order."
+}
+  ,returns = "Which bin N the value falls in such that bound(N-1) 
< value <= bound(N). " +
+  "No min and max bounds are provided, so values smaller than the 
0'th bound go in the 0'th bin, " +
+  "and values greater than the last bound go in the M'th bin."
+  )
+  public static class Bin extends BaseStellarFunction {
+
+public static int getBin(double value, List bins, 
Function boundFunc) {
--- End diff --

Nice use of lambdas to expand this into, essentially, the "mapped" function 
you were talking about.
Results in a very parsimonious implementation of the STATS_BIN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93575237
  
--- Diff: 
metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java
 ---
@@ -373,15 +373,16 @@ public void statsBinRunner(List splits) 
throws Exception {
 
   public void statsBinRunner(List splits, String splitsName) 
throws Exception {
 int bin = 0;
+StatisticsProvider provider = 
(StatisticsProvider)variables.get("stats");
 for(Double d : stats.getSortedValues()) {
-  StatisticsProvider provider = 
(StatisticsProvider)variables.get("stats");
   if(bin < splits.size()) {
 double percentileOfBin = provider.getPercentile(splits.get(bin));
 if (d > percentileOfBin) {
   //we aren't the right bin, so let's find the right one.
   // Keep in mind that this value could be more than one bin away 
from the last good bin.
-  for(;bin < splits.size() && d > 
provider.getPercentile(splits.get(bin));bin++) {
-
+  while ( bin < splits.size()  &&  d > 
provider.getPercentile(splits.get(bin)) ) {
+//increment the bin number until it includes the target value, 
or we run out of bins
+bin++;
--- End diff --

This whole block:
```
  if(bin < splits.size()) {
double percentileOfBin = provider.getPercentile(splits.get(bin));
if (d > percentileOfBin) {
  //we aren't the right bin, so let's find the right one.
  // Keep in mind that this value could be more than one bin away 
from the last good bin.
  while ( bin < splits.size()  &&  d > 
provider.getPercentile(splits.get(bin)) ) {
//increment the bin number until it includes the target value, 
or we run out of bins
bin++;
  }
}
  }
``` 
can be replaced by:
```
  while ( bin < splits.size()  &&  d > 
provider.getPercentile(splits.get(bin)) ) {
//increment the bin number until it includes the target value, or 
we run out of bins
bin++;
  }
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #401: METRON-637: Add a STATS_BIN function to Stellar...

2016-12-21 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/401
  
Thanks for the feedback, @mattf-horton!  I went ahead and incorporated your 
changes.  For posterity and so this gets replicated to the JIRA, I added a 
`BIN` function that just takes a list of bounds, not a list of percentiles to 
compute.  `STATS_BIN` could be encoded with `BIN` if we had a `MAP` function 
(i.e. `STATS_BIN` == `BIN( value, MAP( _GET_PERCENTILE(stats, _ ), [ 
25.0, 50.0, 75.0])` where `MAP` takes a function pointer and applies it to a 
collection).  

We do not have this capability yet in Stellar, but it may be worth 
considering to enable these kinds of use-cases.  Just a thought.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #402: METRON-639: The Network Stellar function...

2016-12-21 Thread cestella
GitHub user cestella opened a pull request:

https://github.com/apache/incubator-metron/pull/402

METRON-639: The Network Stellar functions need to have better unit testing

We have very little unit test coverage around the Networking functions in 
Stellar at the edge level. When diving a bit deeper on real data, I found a 
number of bugs around:
* Domains with TLDs that are not part of the proper set of TLDs
* URIs with schemes that Java doesn't know about.
* N_SUBNET takes multiple CIDRs, but only evaluates the first one.
* Generally calling validate on these methods can be unsafe because they do 
not handle null arguments correctly.

This just shores up the functions and provides more explicit and detailed 
unit testing.  If I missed a test case or two, let me know and I'll add it in.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cestella/incubator-metron NetworkFunctionBug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-metron/pull/402.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #402


commit 16720acf00f5f9a62bbacfbba959baf0db97fc47
Author: cstella 
Date:   2016-12-21T23:00:57Z

Added more rigorous tests to look at edge cases for network stellar 
functions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93537026
  
--- Diff: 
metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java
 ---
@@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
 assertEquals(stats.getSkewness(), (Double) actual, 0.1);
   }
 
+  @Test
+  public void testStatsBin() throws Exception {
+statsInit(windowSize);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, 
"'QUARTILE'");
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, 
"'QUINTILE'");
+statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, 
"'DECILE'");
+statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 
75.0]");
+  }
+
+  public void statsBinRunner(List splits) throws Exception {
+statsBinRunner(splits, null);
+  }
+
+  public void statsBinRunner(List splits, String splitsName) 
throws Exception {
+int bin = 0;
+for(Double d : stats.getSortedValues()) {
+  StatisticsProvider provider = 
(StatisticsProvider)variables.get("stats");
+  if(bin < splits.size()) {
+double percentileOfBin = provider.getPercentile(splits.get(bin));
+if (d > percentileOfBin) {
--- End diff --

Yes, `provider.getPercentile(x)` returns the value at percentile `x`.  So, 
to get the median, you'd call `provider.getPercentile(50.0)`. 

If you're thinking of the inverse (give me the percentile that a value 
falls at), we don't have that dude yet, but it's called the cumulative 
distribution function or (`ecdf` in R parlance).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93536290
  
--- Diff: 
metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java
 ---
@@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
 assertEquals(stats.getSkewness(), (Double) actual, 0.1);
   }
 
+  @Test
+  public void testStatsBin() throws Exception {
+statsInit(windowSize);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, 
"'QUARTILE'");
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, 
"'QUINTILE'");
+statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, 
"'DECILE'");
+statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 
75.0]");
+  }
+
+  public void statsBinRunner(List splits) throws Exception {
+statsBinRunner(splits, null);
+  }
+
+  public void statsBinRunner(List splits, String splitsName) 
throws Exception {
+int bin = 0;
+for(Double d : stats.getSortedValues()) {
+  StatisticsProvider provider = 
(StatisticsProvider)variables.get("stats");
+  if(bin < splits.size()) {
+double percentileOfBin = provider.getPercentile(splits.get(bin));
+if (d > percentileOfBin) {
--- End diff --

I see, so provider.getPercentile() actually takes a percentile value (as a 
number between 0 and 99.99) and returns a corresponding raw value?  That makes 
sense if so.  Please confirm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93535574
  
--- Diff: 
metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java
 ---
@@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
 assertEquals(stats.getSkewness(), (Double) actual, 0.1);
   }
 
+  @Test
+  public void testStatsBin() throws Exception {
+statsInit(windowSize);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, 
"'QUARTILE'");
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, 
"'QUINTILE'");
+statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, 
"'DECILE'");
+statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 
75.0]");
+  }
+
+  public void statsBinRunner(List splits) throws Exception {
+statsBinRunner(splits, null);
+  }
+
+  public void statsBinRunner(List splits, String splitsName) 
throws Exception {
+int bin = 0;
+for(Double d : stats.getSortedValues()) {
+  StatisticsProvider provider = 
(StatisticsProvider)variables.get("stats");
+  if(bin < splits.size()) {
+double percentileOfBin = provider.getPercentile(splits.get(bin));
+if (d > percentileOfBin) {
+  //we aren't the right bin, so let's find the right one.
+  // Keep in mind that this value could be more than one bin away 
from the last good bin.
+  for(;bin < splits.size() && d > 
provider.getPercentile(splits.get(bin));bin++) {
+
--- End diff --

Hopefully the explanation above makes things clearer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93528090
  
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is 
used.
   * Input:
 * stats - The Stellar statistics object
   * Returns: The variance of the values in the window or NaN if the 
statistics object is null.
+* `STATS_BIN`
+  * Description: Computes the bin that the value is in based on the 
statistical distribution. 
+  * Input:
+* stats - The Stellar statistics object
+* value - The value to bin
+* range? - A list of percentile bin ranges (excluding min and max) or 
a string representing a known and common set of bins.  For convenience, we have 
provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. 
If this argument is omitted, then we assume a Quartile bin split.
--- End diff --

Sure!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93530871
  
--- Diff: 
metron-analytics/metron-statistics/src/test/java/org/apache/metron/statistics/StellarStatisticsFunctionsTest.java
 ---
@@ -356,6 +357,44 @@ public void testSkewness() throws Exception {
 assertEquals(stats.getSkewness(), (Double) actual, 0.1);
   }
 
+  @Test
+  public void testStatsBin() throws Exception {
+statsInit(windowSize);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split);
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUARTILE.split, 
"'QUARTILE'");
+
statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.QUINTILE.split, 
"'QUINTILE'");
+statsBinRunner(StellarStatisticsFunctions.Bin.BinSplits.DECILE.split, 
"'DECILE'");
+statsBinRunner(ImmutableList.of(25.0, 50.0, 75.0), "[25.0, 50.0, 
75.0]");
+  }
+
+  public void statsBinRunner(List splits) throws Exception {
+statsBinRunner(splits, null);
+  }
+
+  public void statsBinRunner(List splits, String splitsName) 
throws Exception {
+int bin = 0;
+for(Double d : stats.getSortedValues()) {
+  StatisticsProvider provider = 
(StatisticsProvider)variables.get("stats");
+  if(bin < splits.size()) {
+double percentileOfBin = provider.getPercentile(splits.get(bin));
+if (d > percentileOfBin) {
--- End diff --

Sorry if I'm wrong here, but I couldn't find the definition of 
stats.getSortedValues().
Isn't this line 380 comparing a raw value `d` to a percentile value 
`percentileOfBin` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #397: METRON-627: Add HyperLogLogPlus implemen...

2016-12-21 Thread mmiklavc
Github user mmiklavc commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/397#discussion_r93529216
  
--- Diff: 
metron-platform/metron-common/src/main/java/org/apache/metron/common/utils/HyperLogLogPlus.java
 ---
@@ -0,0 +1,102 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.common.utils;
+
+import 
com.clearspring.analytics.stream.cardinality.CardinalityMergeException;
+import com.clearspring.analytics.stream.cardinality.ICardinality;
+import com.google.common.collect.Lists;
+
+import java.io.Serializable;
+import java.util.List;
+
--- End diff --

Heads up, I searched arxiv and  it doesn't appear that this paper is there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93524927
  
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is 
used.
   * Input:
 * stats - The Stellar statistics object
   * Returns: The variance of the values in the window or NaN if the 
statistics object is null.
+* `STATS_BIN`
+  * Description: Computes the bin that the value is in based on the 
statistical distribution. 
+  * Input:
+* stats - The Stellar statistics object
+* value - The value to bin
+* range? - A list of percentile bin ranges (excluding min and max) or 
a string representing a known and common set of bins.  For convenience, we have 
provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. 
If this argument is omitted, then we assume a Quartile bin split.
--- End diff --

It seems like it would be barely more effort than writing another 
StellarFunction wrapper, but maybe I'm being optimistic again.  I would support 
including both here, but if you prefer to separate them I'm fine with that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93520475
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java
 ---
@@ -425,4 +428,74 @@ public Object apply(List args) {
   return result;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(namespace = "STATS", name = "BIN"
+  , description = "Computes the bin that the value is in based on 
the statistical distribution."
+  , params = {
+  "stats - The Stellar statistics object"
+  , "value - The value to bin"
+  , "range? - A list of percentile bin ranges (excluding min and 
max) or a string representing a known and common set of bins.  " +
+  "For convenience, we have provided QUARTILE, QUINTILE, and 
DECILE which you can pass in as a string arg." +
+  " If this argument is omitted, then we assume a Quartile bin 
split."
--- End diff --

This is a list of bounds rather than ranges.  More precisely, the N'th 
bound specifies the closed upper bound of the N'th bin, where N is zero 
indexed; and if there is a list of M bounds then there are M+1 bins, numbered 0 
through M. (Bin M is the bin for values greater than last bound in the list.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93524020
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java
 ---
@@ -425,4 +428,74 @@ public Object apply(List args) {
   return result;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(namespace = "STATS", name = "BIN"
+  , description = "Computes the bin that the value is in based on 
the statistical distribution."
+  , params = {
+  "stats - The Stellar statistics object"
+  , "value - The value to bin"
+  , "range? - A list of percentile bin ranges (excluding min and 
max) or a string representing a known and common set of bins.  " +
+  "For convenience, we have provided QUARTILE, QUINTILE, and 
DECILE which you can pass in as a string arg." +
+  " If this argument is omitted, then we assume a Quartile bin 
split."
+  }
+  , returns = "Which bin the value falls in such that bin < value 
< bin + 1"
+  )
+  public static class Bin extends BaseStellarFunction {
+public enum BinSplits {
+  QUARTILE(ImmutableList.of(25.0, 50.0, 75.0)),
+  QUINTILE(ImmutableList.of(20.0, 40.0, 60.0, 80.0)),
+  DECILE(ImmutableList.of(10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 
80.0, 90.0))
+  ;
+  public final List split;
+  BinSplits(List split) {
+this.split = split;
+  }
+
+  public static List getSplit(Object o) {
+if(o instanceof String) {
+  return BinSplits.valueOf((String)o).split;
+}
+else if(o instanceof List) {
+  List ret = new ArrayList<>();
+  for(Object valO : (List)o) {
+ret.add(ConversionUtils.convert(valO, Double.class));
+  }
+  return ret;
+}
+throw new IllegalStateException("The split you tried to pass is 
not a valid split: " + o.toString());
+  }
+}
+
+
+@Override
+public Object apply(List args) {
+  StatisticsProvider stats = convert(args.get(0), 
StatisticsProvider.class);
+  Double value = convert(args.get(1), Double.class);
+  List bins = BinSplits.QUARTILE.split;
+  if (args.size() > 2) {
+bins = BinSplits.getSplit(args.get(2));
+  }
+  if (stats == null || value == null || bins.size() == 0) {
+return -1;
+  }
+
+  double prevPctile = stats.getPercentile(bins.get(0));
+
+  if(value <= prevPctile) {
+return 0;
+  }
+  for(int bin = 1; bin < bins.size();++bin) {
+double pctile = stats.getPercentile(bins.get(bin));
+if(value > prevPctile && value <= pctile) {
--- End diff --

haha yes, you can.  I'll correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93523895
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java
 ---
@@ -425,4 +428,74 @@ public Object apply(List args) {
   return result;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(namespace = "STATS", name = "BIN"
+  , description = "Computes the bin that the value is in based on 
the statistical distribution."
+  , params = {
+  "stats - The Stellar statistics object"
+  , "value - The value to bin"
+  , "range? - A list of percentile bin ranges (excluding min and 
max) or a string representing a known and common set of bins.  " +
+  "For convenience, we have provided QUARTILE, QUINTILE, and 
DECILE which you can pass in as a string arg." +
+  " If this argument is omitted, then we assume a Quartile bin 
split."
+  }
+  , returns = "Which bin the value falls in such that bin < value 
< bin + 1"
--- End diff --

yup, will do; great suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread cestella
Github user cestella commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93523813
  
--- Diff: metron-analytics/metron-statistics/README.md ---
@@ -112,6 +112,13 @@ functions can be used from everywhere where Stellar is 
used.
   * Input:
 * stats - The Stellar statistics object
   * Returns: The variance of the values in the window or NaN if the 
statistics object is null.
+* `STATS_BIN`
+  * Description: Computes the bin that the value is in based on the 
statistical distribution. 
+  * Input:
+* stats - The Stellar statistics object
+* value - The value to bin
+* range? - A list of percentile bin ranges (excluding min and max) or 
a string representing a known and common set of bins.  For convenience, we have 
provided QUARTILE, QUINTILE, and DECILE which you can pass in as a string arg. 
If this argument is omitted, then we assume a Quartile bin split.
--- End diff --

It would, for sure, but probably not called `STATS_BIN`.  I was thinking of 
adding a proper `BIN` function as a follow-on and refactoring this one to use 
it.  Or do you think we should just bite the bullet and create it all here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/401#discussion_r93520478
  
--- Diff: 
metron-analytics/metron-statistics/src/main/java/org/apache/metron/statistics/StellarStatisticsFunctions.java
 ---
@@ -425,4 +428,74 @@ public Object apply(List args) {
   return result;
 }
   }
+
+  /**
+   * Calculates the statistical bin that a value falls in.
+   */
+  @Stellar(namespace = "STATS", name = "BIN"
+  , description = "Computes the bin that the value is in based on 
the statistical distribution."
+  , params = {
+  "stats - The Stellar statistics object"
+  , "value - The value to bin"
+  , "range? - A list of percentile bin ranges (excluding min and 
max) or a string representing a known and common set of bins.  " +
+  "For convenience, we have provided QUARTILE, QUINTILE, and 
DECILE which you can pass in as a string arg." +
+  " If this argument is omitted, then we assume a Quartile bin 
split."
+  }
+  , returns = "Which bin the value falls in such that bin < value 
< bin + 1"
--- End diff --

One of the "<" has to be "<=", please.  From code, it's the upper one.

Suggest phrasing as:
returns = "Which bin N the value falls in such that bound(N-1) < value <= 
bound(N). " +
"No min and max bounds are provided, so values smaller than the 0'th bound 
go in the 0'th bin, " +
"and values greater than the last bound go in the M'th bin."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Apache Lic on README.md files?

2016-12-21 Thread Otto Fowler
Looks like they use XML escapes 



On December 21, 2016 at 15:48:07, Casey Stella (ceste...@gmail.com) wrote:

We should do this. I was not aware that Markdown took comments. Every
file which is capable of handling comments should have an apache license,
IIRC.

On Wed, Dec 21, 2016 at 3:45 PM, Otto Fowler 
wrote:

> I was just looking at the NiFi source, and noticed that they include the
> license header in their readme files. Is this something that we are
> supposed to be doing or is it optional?
>


Re: Apache Lic on README.md files?

2016-12-21 Thread Casey Stella
We should do this.  I was not aware that Markdown took comments.  Every
file which is capable of handling comments should have an apache license,
IIRC.

On Wed, Dec 21, 2016 at 3:45 PM, Otto Fowler 
wrote:

> I was just looking at the NiFi source, and noticed that they include the
> license header in their readme files.  Is this something that we are
> supposed to be doing or is it optional?
>


[GitHub] incubator-metron issue #393: METRON-622: Create a Metron Docker Compose appl...

2016-12-21 Thread kylerichardson
Github user kylerichardson commented on the issue:

https://github.com/apache/incubator-metron/pull/393
  
Thanks for the explanation @merrimanr. I totally agree on the need for the
local IDE to have access to the containerized services. It's an easy enough
fix for me to manipulate the DOCKER_HOST env variable on Fedora and avoid
using docker-machine (although, I do see that as a big win for ease of use
on Mac and Windows).

I haven't had much luck with vagrant myself so am super excited for a
docker alternative to quick dev. As an aside, there are some Ansible
modules for docker (
https://docs.ansible.com/ansible/list_of_cloud_modules.html#docker) that we
could look into for incorporating some of the setup scripts and image
building to make it even more user friendly. Might just be a low priority
nice to have but thought I'd throw it out there.

On Mon, Dec 19, 2016 at 4:32 PM, merrimanr  wrote:

> @kylerichardson  thank you for
> reviewing it! I updated the documentation to include Docker for Mac or
> Docker for Windows. One of the primary requirements for the way I use this
> is that containers must be accessible from my local environment where my
> IDE is running so services that broadcast their host address are tricky.
> The Kafka advertised listener thing was the single most challenging issue 
I
> faced so not surprised you hit it too. The Kafka Dockerfile is wired to
> pull the DOCKER_HOST from an input argument and the compose file is wired
> to pass the local DOCKER_HOST environment variable as the DOCKER_HOST 
input
> argument to the Kafka Dockerfile. So you should be able to set your local
> DOCKER_HOST environment variable as such:
>
> $ export DOCKER_HOST="tcp://:2376"
>
> When you run "eval $(docker-machine env metron-machine)" that's pretty
> much what Docker Machine is doing, setting local environment variables to
> match the desired host. Then after you build the environment, the
> advertised.listener property should be set to without you having to
> manually change it.
>
> I only used the virtualbox drive because that's what came out of the box.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> 
,
> or mute the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Otto Fowler
+1



On December 21, 2016 at 14:56:06, Michael Miklavcic (
michael.miklav...@gmail.com) wrote:

Works for me also.

On Wed, Dec 21, 2016 at 12:38 PM, Matt Foley  wrote:

> Works for me, thanks.
>
> On 12/21/16, 11:21 AM, "Casey Stella"  wrote:
>
> Sure, how about making it generic to "a deployed cluster"?
>
> On Wed, Dec 21, 2016 at 2:20 PM, Matt Foley  wrote:
>
> > +1 on Casey’s first edit. However, wrt the second, can we please not
> > require vagrant? Any of our single-node test deployments, including
> > vagrant, ansible, mpack, or (soon :-) docker, should be acceptable.
> >
> > Thanks,
> > --Matt (who can’t run vagrant workably on the systems available to
> me)
> >
> >
> > On 12/21/16, 8:52 AM, "Michael Miklavcic" <
> michael.miklav...@gmail.com>
> > wrote:
> >
> > Agreed on Casey's addition to 2.5. What do you think about
> saying the
> > pla
> > should be stated on the PR, since that will be replicated to Jira
> > automatically?
> >
> > On Wed, Dec 21, 2016 at 7:49 AM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> >
> > > Oh, one more, I propose the following addition to 2.5:
> > > >
> > > > JIRAs will have a description of how to exercise the
> functionality
> > in a
> > > > step-by-step manner on a Quickdev vagrant instance to aid
> review
> > and
> > > > validation.
> > >
> > >
> > > When Mike, Otto and I moved the system to the current version
> of
> > Storm, we
> > > needed a broader smoke test than just running data through that
> > exercised a
> > > variety of the features. We pulled those smoke tests from the
> various
> > > discussions in the JIRAs.
> > >
> > >
> > >
> > > On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> > >
> > > > We have been having a lively discussion on METRON-590 (see
> > > > https://github.com/apache/incubator-metron/pull/395) around
> > creating
> > > > multiple abstractions to do the same (or very nearly the
> same)
> > thing.
> > > >
> > > > I'd like to propose an addition to section 2.3 which reads:
> > > >
> > > >> Contributions which provide abstractions which are either
> very
> > similar
> > > to
> > > >> or a subset of existing abstractions should use and extend
> > existing
> > > >> abstractions rather than provide competing abstractions
> unless
> > > engineering
> > > >> exigencies (e.g. performance ) make such an operation
> impossible
> > without
> > > >> compromising core functionality of the platform.
> > > >
> > > >
> > > > I'd like to suggest the following anecdote from the early
> years of
> > the
> > > > codebase to justify the above:
> > > >
> > > > Stellar started as a predicate language only for threat
> triage
> > rules. As
> > > > such, when the task of creating Field Transformations came
> to me, I
> > > needed
> > > > something like Stellar except I needed it to return arbitrary
> > objects,
> > > > rather than just booleans. In my infinite wisdom, I chose to
> fork
> > the
> > > > language, create a second, more specific DSL for field
> > transformations,
> > > > thereby creating "Metron Query Language" and "Metron
> Transformation
> > > > Language."
> > > >
> > > > I felt a nagging feeling at the time that I should just
> expand the
> > query
> > > > language, but I convinced myself that it would require too
> much
> > testing
> > > and
> > > > it would be a change that was too broad in scope. It took 3
> months
> > for me
> > > > to get around to unifying those languages and if we had more
> > people using
> > > > it, it would have been an absolute nightmare.
> > > >
> > > > On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella <
> ceste...@gmail.com>
> > > wrote:
> > > >
> > > >> Yeah, I +1 the notion of thorough automated tests.
> > > >>
> > > >> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley <
> ma...@apache.org>
> > wrote:
> > > >>
> > > >>> Hard to mark diffs in text-only mode :-) My proposed
> change is:
> > > >>>
> > > >>> >> All merged patches will be reviewed with the
> expectation that
> > > >>> thorough automated tests shall be provided and are
> consistent
> > with …
> > > >>>
> > > >>> 
> > > >>> ^^
> > > >>> Added word “thorough” and changed “exist” to “shall be
> provided”.
> > > >>> Thanks,
> > > >>> --Matt
> > > >>>
> > > >>> On 12/20/16, 1:22 PM, "James Sirota" 
> wrote:
> > > >>>
> > > >>> Hi Matt, thats already in there. See last bullet point
> of 2.6
> > > >>>
> > > >>> 20.12.2016, 14:14, "Matt Foley" :
> > > >>> > If we aren't going to require 100% test coverage for
> new
> > code,
> > > >>> then we should at least say "thorough" automated tests, in
> the
> > last
> > > bullet
> > > >>> of 2.6. And it should be a mandate not an assumption:
> > > >>> >
> > > >>> > All merged patches will be reviewed with the
> expectation
> > that
> > > >>> thorough automated tests shall be provided and are
> consistent
> > with
> > > project
> > > >>> testing methodology and practices, and 

[GitHub] incubator-metron pull request #401: METRON-637: Add a STATS_BIN function to ...

2016-12-21 Thread cestella
GitHub user cestella opened a pull request:

https://github.com/apache/incubator-metron/pull/401

METRON-637: Add a STATS_BIN function to Stellar.

When passing parameters to models, it's often useful to pass the binned 
representation of a variable based on an empirical statistical distribution, 
rather than the actual variable. This function should accept a set of 
percentile bins and a statistical sketch and a value. It should return the 
index where the percentile of the value falls.

For instance, consider the value 17 who is percentile 27. If we use 25, 75, 
95 to define our bins, this function would return 1, because its percentile, 
27, is between 25 and 75.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cestella/incubator-metron METRON-637

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-metron/pull/401.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #401


commit 6cc98a22af1efe0fc548c59e4a1fa379a41e245b
Author: cstella 
Date:   2016-12-21T17:47:51Z

Added STATS_BIN function.

commit 81921691ef413b407117764b0597c471f7eebf30
Author: cstella 
Date:   2016-12-21T19:22:43Z

Documentation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Michael Miklavcic
Works for me also.

On Wed, Dec 21, 2016 at 12:38 PM, Matt Foley  wrote:

> Works for me, thanks.
>
> On 12/21/16, 11:21 AM, "Casey Stella"  wrote:
>
> Sure, how about making it generic to "a deployed cluster"?
>
> On Wed, Dec 21, 2016 at 2:20 PM, Matt Foley  wrote:
>
> > +1 on Casey’s first edit.  However, wrt the second, can we please not
> > require vagrant?  Any of our single-node test deployments, including
> > vagrant, ansible, mpack, or (soon :-) docker, should be acceptable.
> >
> > Thanks,
> > --Matt (who can’t run vagrant workably on the systems available to
> me)
> >
> >
> > On 12/21/16, 8:52 AM, "Michael Miklavcic" <
> michael.miklav...@gmail.com>
> > wrote:
> >
> > Agreed on Casey's addition to 2.5. What do you think about
> saying the
> > pla
> > should be stated on the PR, since that will be replicated to Jira
> > automatically?
> >
> > On Wed, Dec 21, 2016 at 7:49 AM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> >
> > > Oh, one more, I propose the following addition to 2.5:
> > > >
> > > > JIRAs will have a description of how to exercise the
> functionality
> > in a
> > > > step-by-step manner on a Quickdev vagrant instance to aid
> review
> > and
> > > > validation.
> > >
> > >
> > > When Mike, Otto and I moved the system to the current version
> of
> > Storm, we
> > > needed a broader smoke test than just running data through that
> > exercised a
> > > variety of the features. We pulled those smoke tests from the
> various
> > > discussions in the JIRAs.
> > >
> > >
> > >
> > > On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella <
> ceste...@gmail.com>
> > wrote:
> > >
> > > > We have been having a lively discussion on METRON-590 (see
> > > > https://github.com/apache/incubator-metron/pull/395) around
> > creating
> > > > multiple abstractions to do the same (or very nearly the
> same)
> > thing.
> > > >
> > > > I'd like to propose an addition to section 2.3 which reads:
> > > >
> > > >> Contributions which provide abstractions which are either
> very
> > similar
> > > to
> > > >> or a subset of existing abstractions should use and extend
> > existing
> > > >> abstractions rather than provide competing abstractions
> unless
> > > engineering
> > > >> exigencies (e.g. performance ) make such an operation
> impossible
> > without
> > > >> compromising core functionality of the platform.
> > > >
> > > >
> > > > I'd like to suggest the following anecdote from the early
> years of
> > the
> > > > codebase to justify the above:
> > > >
> > > > Stellar started as a predicate language only for threat
> triage
> > rules. As
> > > > such, when the task of creating Field Transformations came
> to me, I
> > > needed
> > > > something like Stellar except I needed it to return arbitrary
> > objects,
> > > > rather than just booleans. In my infinite wisdom, I chose to
> fork
> > the
> > > > language, create a second, more specific DSL for field
> > transformations,
> > > > thereby creating "Metron Query Language" and "Metron
> Transformation
> > > > Language."
> > > >
> > > > I felt a nagging feeling at the time that I should just
> expand the
> > query
> > > > language, but I convinced myself that it would require too
> much
> > testing
> > > and
> > > > it would be a change that was too broad in scope. It took 3
> months
> > for me
> > > > to get around to unifying those languages and if we had more
> > people using
> > > > it, it would have been an absolute nightmare.
> > > >
> > > > On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella <
> ceste...@gmail.com>
> > > wrote:
> > > >
> > > >> Yeah, I +1 the notion of thorough automated tests.
> > > >>
> > > >> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley <
> ma...@apache.org>
> > wrote:
> > > >>
> > > >>> Hard to mark diffs in text-only mode :-)  My proposed
> change is:
> > > >>>
> > > >>> >> All merged patches will be reviewed with the
> expectation that
> > > >>> thorough automated tests shall be provided and are
> consistent
> > with …
> > > >>>
> > > >>>
> > > >>>  ^^
> > > >>> Added word “thorough” and changed “exist” to “shall be
> provided”.
> > > >>> Thanks,
> > > >>> --Matt
> > > >>>
> > > >>> On 12/20/16, 1:22 PM, 

[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
Here is the Jira:  https://issues.apache.org/jira/browse/METRON-638


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
No problem @nickwallen, I agree you shouldn't have to tackle all of this on 
your own.  I will volunteer for the ConfiguredBolt refactor task.  Let me 
create a Jira and we can start collaborating on a design.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
Recording here for posterity, in case anyone is interested:  I asked on the 
storm user list:
```
I’ve been unable to find this in the Storm documentation:
How does the Storm topology loader identify if a Bolt is a “windowing” 
bolt and therefore 
needs to be wrapped with WindowedBoltExecutor?  Does it look at:
1.  Whether the Bolt’s getComponentConfiguration() method presents 
some or all of the 
windowing-related configuration parameters, as BaseWindowedBolt does;
2.  Or does it use reflection to determine if the Bolt implements 
IWindowedBolt interface;
3.  Or does the Bolt actually have to extend BaseWindowedBolt?
There are indications in the docs that it is #2, but I wasn’t able to 
become certain.  
Please clarify.
Thanks,
--Matt
```
and got confirmation that it is #2 only:
```
From: Arthur Maciejewicz 
You must satisfy the IWindowedBolt and IComponent interfaces. 
BaseWindowedBolt is there for 
your convenience. When constructing the topology, there is a setBolt method 
on TopologyBuilder 
specifically for bolts satisfying the IWindowedBolt interface. It will be 
wrapped with a 
WindowedBoltExecutor by the TopologyBuilder for you. You can implement 
windows yourself by 
returning a HashMap from getComponentConfiguration in your custom bolt (as 
long as they also 
implement the IWindowedBolt interface).
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Matt Foley
Works for me, thanks.

On 12/21/16, 11:21 AM, "Casey Stella"  wrote:

Sure, how about making it generic to "a deployed cluster"?

On Wed, Dec 21, 2016 at 2:20 PM, Matt Foley  wrote:

> +1 on Casey’s first edit.  However, wrt the second, can we please not
> require vagrant?  Any of our single-node test deployments, including
> vagrant, ansible, mpack, or (soon :-) docker, should be acceptable.
>
> Thanks,
> --Matt (who can’t run vagrant workably on the systems available to me)
>
>
> On 12/21/16, 8:52 AM, "Michael Miklavcic" 
> wrote:
>
> Agreed on Casey's addition to 2.5. What do you think about saying the
> pla
> should be stated on the PR, since that will be replicated to Jira
> automatically?
>
> On Wed, Dec 21, 2016 at 7:49 AM, Casey Stella 
> wrote:
>
> > Oh, one more, I propose the following addition to 2.5:
> > >
> > > JIRAs will have a description of how to exercise the functionality
> in a
> > > step-by-step manner on a Quickdev vagrant instance to aid review
> and
> > > validation.
> >
> >
> > When Mike, Otto and I moved the system to the current version of
> Storm, we
> > needed a broader smoke test than just running data through that
> exercised a
> > variety of the features. We pulled those smoke tests from the 
various
> > discussions in the JIRAs.
> >
> >
> >
> > On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella 
> wrote:
> >
> > > We have been having a lively discussion on METRON-590 (see
> > > https://github.com/apache/incubator-metron/pull/395) around
> creating
> > > multiple abstractions to do the same (or very nearly the same)
> thing.
> > >
> > > I'd like to propose an addition to section 2.3 which reads:
> > >
> > >> Contributions which provide abstractions which are either very
> similar
> > to
> > >> or a subset of existing abstractions should use and extend
> existing
> > >> abstractions rather than provide competing abstractions unless
> > engineering
> > >> exigencies (e.g. performance ) make such an operation impossible
> without
> > >> compromising core functionality of the platform.
> > >
> > >
> > > I'd like to suggest the following anecdote from the early years of
> the
> > > codebase to justify the above:
> > >
> > > Stellar started as a predicate language only for threat triage
> rules. As
> > > such, when the task of creating Field Transformations came to me, 
I
> > needed
> > > something like Stellar except I needed it to return arbitrary
> objects,
> > > rather than just booleans. In my infinite wisdom, I chose to fork
> the
> > > language, create a second, more specific DSL for field
> transformations,
> > > thereby creating "Metron Query Language" and "Metron 
Transformation
> > > Language."
> > >
> > > I felt a nagging feeling at the time that I should just expand the
> query
> > > language, but I convinced myself that it would require too much
> testing
> > and
> > > it would be a change that was too broad in scope. It took 3 months
> for me
> > > to get around to unifying those languages and if we had more
> people using
> > > it, it would have been an absolute nightmare.
> > >
> > > On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella 
> > wrote:
> > >
> > >> Yeah, I +1 the notion of thorough automated tests.
> > >>
> > >> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley 
> wrote:
> > >>
> > >>> Hard to mark diffs in text-only mode :-)  My proposed change is:
> > >>>
> > >>> >> All merged patches will be reviewed with the expectation that
> > >>> thorough automated tests shall be provided and are consistent
> with …
> > >>>
> > >>>
> > >>>  ^^
> > >>> Added word “thorough” and changed “exist” to “shall be 
provided”.
> > >>> Thanks,
> > >>> --Matt
> > >>>
> > >>> On 12/20/16, 1:22 PM, "James Sirota"  wrote:
> > >>>
> > >>> Hi Matt, thats already in there. See last bullet point of 
2.6
> > >>>
> > >>> 20.12.2016, 14:14, "Matt Foley" :
> > >>> > If we aren't going to require 100% test coverage for new
> code,
> > >>> then we should at least say 

[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
@nickwallen , totally understand, and I respect your efforts to keep things 
as simple as possible in the original scope.  But I'm much happier with this 
scoping :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
Ok, gotcha @mattf-horton.  Understand now.  

To be fair, in this PR, I wasn't trying to replace `ConfiguredBolt` so what 
I have in `ZkConfigurationManager` was the easiest way I found to get the 
functionality that I needed for the Profiler.  In submitting a separate PR as 
described in step (2),  whose scope is much greater, your guidelines on 
refactoring sound logical to me.

And I don't want to presume that I should be the one to tackle this even.  
@merrimanr I'm open to any volunteers who you think are better qualified.  I 
would have prefered not to touch this code at all.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Casey Stella
Sure, how about making it generic to "a deployed cluster"?

On Wed, Dec 21, 2016 at 2:20 PM, Matt Foley  wrote:

> +1 on Casey’s first edit.  However, wrt the second, can we please not
> require vagrant?  Any of our single-node test deployments, including
> vagrant, ansible, mpack, or (soon :-) docker, should be acceptable.
>
> Thanks,
> --Matt (who can’t run vagrant workably on the systems available to me)
>
>
> On 12/21/16, 8:52 AM, "Michael Miklavcic" 
> wrote:
>
> Agreed on Casey's addition to 2.5. What do you think about saying the
> pla
> should be stated on the PR, since that will be replicated to Jira
> automatically?
>
> On Wed, Dec 21, 2016 at 7:49 AM, Casey Stella 
> wrote:
>
> > Oh, one more, I propose the following addition to 2.5:
> > >
> > > JIRAs will have a description of how to exercise the functionality
> in a
> > > step-by-step manner on a Quickdev vagrant instance to aid review
> and
> > > validation.
> >
> >
> > When Mike, Otto and I moved the system to the current version of
> Storm, we
> > needed a broader smoke test than just running data through that
> exercised a
> > variety of the features. We pulled those smoke tests from the various
> > discussions in the JIRAs.
> >
> >
> >
> > On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella 
> wrote:
> >
> > > We have been having a lively discussion on METRON-590 (see
> > > https://github.com/apache/incubator-metron/pull/395) around
> creating
> > > multiple abstractions to do the same (or very nearly the same)
> thing.
> > >
> > > I'd like to propose an addition to section 2.3 which reads:
> > >
> > >> Contributions which provide abstractions which are either very
> similar
> > to
> > >> or a subset of existing abstractions should use and extend
> existing
> > >> abstractions rather than provide competing abstractions unless
> > engineering
> > >> exigencies (e.g. performance ) make such an operation impossible
> without
> > >> compromising core functionality of the platform.
> > >
> > >
> > > I'd like to suggest the following anecdote from the early years of
> the
> > > codebase to justify the above:
> > >
> > > Stellar started as a predicate language only for threat triage
> rules. As
> > > such, when the task of creating Field Transformations came to me, I
> > needed
> > > something like Stellar except I needed it to return arbitrary
> objects,
> > > rather than just booleans. In my infinite wisdom, I chose to fork
> the
> > > language, create a second, more specific DSL for field
> transformations,
> > > thereby creating "Metron Query Language" and "Metron Transformation
> > > Language."
> > >
> > > I felt a nagging feeling at the time that I should just expand the
> query
> > > language, but I convinced myself that it would require too much
> testing
> > and
> > > it would be a change that was too broad in scope. It took 3 months
> for me
> > > to get around to unifying those languages and if we had more
> people using
> > > it, it would have been an absolute nightmare.
> > >
> > > On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella 
> > wrote:
> > >
> > >> Yeah, I +1 the notion of thorough automated tests.
> > >>
> > >> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley 
> wrote:
> > >>
> > >>> Hard to mark diffs in text-only mode :-)  My proposed change is:
> > >>>
> > >>> >> All merged patches will be reviewed with the expectation that
> > >>> thorough automated tests shall be provided and are consistent
> with …
> > >>>
> > >>>
> > >>>  ^^
> > >>> Added word “thorough” and changed “exist” to “shall be provided”.
> > >>> Thanks,
> > >>> --Matt
> > >>>
> > >>> On 12/20/16, 1:22 PM, "James Sirota"  wrote:
> > >>>
> > >>> Hi Matt, thats already in there. See last bullet point of 2.6
> > >>>
> > >>> 20.12.2016, 14:14, "Matt Foley" :
> > >>> > If we aren't going to require 100% test coverage for new
> code,
> > >>> then we should at least say "thorough" automated tests, in the
> last
> > bullet
> > >>> of 2.6. And it should be a mandate not an assumption:
> > >>> >
> > >>> > All merged patches will be reviewed with the expectation
> that
> > >>> thorough automated tests shall be provided and are consistent
> with
> > project
> > >>> testing methodology and practices, and cover the appropriate
> cases (
> > see
> > >>> reviewers guide )
> > >>> >
> > >>> > IMO,
> > >>> > --Matt
> > >>> >
> > >>> > On 12/20/16, 12:51 PM, "James Sirota" 

Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Matt Foley
+1 on Casey’s first edit.  However, wrt the second, can we please not require 
vagrant?  Any of our single-node test deployments, including vagrant, ansible, 
mpack, or (soon :-) docker, should be acceptable.

Thanks,
--Matt (who can’t run vagrant workably on the systems available to me)


On 12/21/16, 8:52 AM, "Michael Miklavcic"  wrote:

Agreed on Casey's addition to 2.5. What do you think about saying the pla
should be stated on the PR, since that will be replicated to Jira
automatically?

On Wed, Dec 21, 2016 at 7:49 AM, Casey Stella  wrote:

> Oh, one more, I propose the following addition to 2.5:
> >
> > JIRAs will have a description of how to exercise the functionality in a
> > step-by-step manner on a Quickdev vagrant instance to aid review and
> > validation.
>
>
> When Mike, Otto and I moved the system to the current version of Storm, we
> needed a broader smoke test than just running data through that exercised 
a
> variety of the features. We pulled those smoke tests from the various
> discussions in the JIRAs.
>
>
>
> On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella  wrote:
>
> > We have been having a lively discussion on METRON-590 (see
> > https://github.com/apache/incubator-metron/pull/395) around creating
> > multiple abstractions to do the same (or very nearly the same) thing.
> >
> > I'd like to propose an addition to section 2.3 which reads:
> >
> >> Contributions which provide abstractions which are either very similar
> to
> >> or a subset of existing abstractions should use and extend existing
> >> abstractions rather than provide competing abstractions unless
> engineering
> >> exigencies (e.g. performance ) make such an operation impossible 
without
> >> compromising core functionality of the platform.
> >
> >
> > I'd like to suggest the following anecdote from the early years of the
> > codebase to justify the above:
> >
> > Stellar started as a predicate language only for threat triage rules. As
> > such, when the task of creating Field Transformations came to me, I
> needed
> > something like Stellar except I needed it to return arbitrary objects,
> > rather than just booleans. In my infinite wisdom, I chose to fork the
> > language, create a second, more specific DSL for field transformations,
> > thereby creating "Metron Query Language" and "Metron Transformation
> > Language."
> >
> > I felt a nagging feeling at the time that I should just expand the query
> > language, but I convinced myself that it would require too much testing
> and
> > it would be a change that was too broad in scope. It took 3 months for 
me
> > to get around to unifying those languages and if we had more people 
using
> > it, it would have been an absolute nightmare.
> >
> > On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella 
> wrote:
> >
> >> Yeah, I +1 the notion of thorough automated tests.
> >>
> >> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley  wrote:
> >>
> >>> Hard to mark diffs in text-only mode :-)  My proposed change is:
> >>>
> >>> >> All merged patches will be reviewed with the expectation that
> >>> thorough automated tests shall be provided and are consistent with …
> >>>
> >>>
> >>>  ^^
> >>> Added word “thorough” and changed “exist” to “shall be provided”.
> >>> Thanks,
> >>> --Matt
> >>>
> >>> On 12/20/16, 1:22 PM, "James Sirota"  wrote:
> >>>
> >>> Hi Matt, thats already in there. See last bullet point of 2.6
> >>>
> >>> 20.12.2016, 14:14, "Matt Foley" :
> >>> > If we aren't going to require 100% test coverage for new code,
> >>> then we should at least say "thorough" automated tests, in the last
> bullet
> >>> of 2.6. And it should be a mandate not an assumption:
> >>> >
> >>> > All merged patches will be reviewed with the expectation that
> >>> thorough automated tests shall be provided and are consistent with
> project
> >>> testing methodology and practices, and cover the appropriate cases (
> see
> >>> reviewers guide )
> >>> >
> >>> > IMO,
> >>> > --Matt
> >>> >
> >>> > On 12/20/16, 12:51 PM, "James Sirota" 
> wrote:
> >>> >
> >>> > Good feedback. Here is the next iteration that accounts for
> >>> your suggestions:
> >>> > https://cwiki.apache.org/confluence/pages/viewpage.action?p
> >>> ageId=61332235
> >>> >
> >>> > 1. How To Contribute
> >>> > We are always very happy to have 

[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
Thanks, @merrimanr for expressing that better than I did.
@nickwallen , one of the principles of reliable refactoring[1] is to 
methodically do only and precisely the changes needed to achieve each design 
goal, so that implementors and reviewers may reason about the correctness of 
the changes, and avoid introducing new bugs or side effects.  

Thus, by starting from a refactoring of ConfiguredBolt, I would expect to 
see many lines in ZkConfigurationManager that look exactly like the lines in 
ConfiguredBolt, in the same order and organization, with only necessary changes 
to accomodate their new packaging.  That should be the starting point.  Then 
new features can be added, such as deserialized caching and perhaps utility 
APIs to assist the sorts of things ConfiguredProfilerBolt does as a client of 
the configuration manager.

[1] Martin Fowler and Kent Beck, "Refactoring: Improving the Design of 
Existing Code" 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #395: METRON-590 Enable Use of Event Time in P...

2016-12-21 Thread nickwallen
Github user nickwallen closed the pull request at:

https://github.com/apache/incubator-metron/pull/395


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread merrimanr
Github user merrimanr commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
I agree with @mattf-horton.  ConfiguredBolt should be improved/refactored 
instead of completely starting over.  Keep in mind this was developed over many 
iterations with input from different people and contains a lot of lessons 
learned and improvements gained from experience.  If we start over we're likely 
to make the same mistakes all over again, especially if this task is taken on 
by someone that was not heavily involved in developing ConfiguredBolt. 

What are the benefits of starting over?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
I don't understand what your proposal is @mattf-horton . Please clarify


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
@nickwallen and @cestella, I'm fine with this conclusion, but from a design 
perspective, the ZkConfigurationManager effort should not be viewed as bringing 
the current file to parity with ConfiguredBolt.  Rather, the starting point 
should be a refactoring of ConfiguredBolt.  ConfiguredBolt is only 60 lines of 
code that does this thing quite well, with real-time event capture.  You want 
to add caching of the deserialized configs as well as the raw ZK info, and 
that's good.  But that can be added on, it is not clear why starting from 
scratch is required.  Clearly deriving from ConfiguredBolt will also make it 
way easier to code review and give confidence in the refactoring.

If you feel it is really necessary to start from scratch, please do a brief 
design document stating why.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Michael Miklavcic
Agreed on Casey's addition to 2.5. What do you think about saying the plan
should be stated on the PR, since that will be replicated to Jira
automatically?

On Wed, Dec 21, 2016 at 7:49 AM, Casey Stella  wrote:

> Oh, one more, I propose the following addition to 2.5:
> >
> > JIRAs will have a description of how to exercise the functionality in a
> > step-by-step manner on a Quickdev vagrant instance to aid review and
> > validation.
>
>
> When Mike, Otto and I moved the system to the current version of Storm, we
> needed a broader smoke test than just running data through that exercised a
> variety of the features. We pulled those smoke tests from the various
> discussions in the JIRAs.
>
>
>
> On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella  wrote:
>
> > We have been having a lively discussion on METRON-590 (see
> > https://github.com/apache/incubator-metron/pull/395) around creating
> > multiple abstractions to do the same (or very nearly the same) thing.
> >
> > I'd like to propose an addition to section 2.3 which reads:
> >
> >> Contributions which provide abstractions which are either very similar
> to
> >> or a subset of existing abstractions should use and extend existing
> >> abstractions rather than provide competing abstractions unless
> engineering
> >> exigencies (e.g. performance ) make such an operation impossible without
> >> compromising core functionality of the platform.
> >
> >
> > I'd like to suggest the following anecdote from the early years of the
> > codebase to justify the above:
> >
> > Stellar started as a predicate language only for threat triage rules. As
> > such, when the task of creating Field Transformations came to me, I
> needed
> > something like Stellar except I needed it to return arbitrary objects,
> > rather than just booleans. In my infinite wisdom, I chose to fork the
> > language, create a second, more specific DSL for field transformations,
> > thereby creating "Metron Query Language" and "Metron Transformation
> > Language."
> >
> > I felt a nagging feeling at the time that I should just expand the query
> > language, but I convinced myself that it would require too much testing
> and
> > it would be a change that was too broad in scope. It took 3 months for me
> > to get around to unifying those languages and if we had more people using
> > it, it would have been an absolute nightmare.
> >
> > On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella 
> wrote:
> >
> >> Yeah, I +1 the notion of thorough automated tests.
> >>
> >> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley  wrote:
> >>
> >>> Hard to mark diffs in text-only mode :-)  My proposed change is:
> >>>
> >>> >> All merged patches will be reviewed with the expectation that
> >>> thorough automated tests shall be provided and are consistent with …
> >>>
> >>>
> >>>  ^^
> >>> Added word “thorough” and changed “exist” to “shall be provided”.
> >>> Thanks,
> >>> --Matt
> >>>
> >>> On 12/20/16, 1:22 PM, "James Sirota"  wrote:
> >>>
> >>> Hi Matt, thats already in there. See last bullet point of 2.6
> >>>
> >>> 20.12.2016, 14:14, "Matt Foley" :
> >>> > If we aren't going to require 100% test coverage for new code,
> >>> then we should at least say "thorough" automated tests, in the last
> bullet
> >>> of 2.6. And it should be a mandate not an assumption:
> >>> >
> >>> > All merged patches will be reviewed with the expectation that
> >>> thorough automated tests shall be provided and are consistent with
> project
> >>> testing methodology and practices, and cover the appropriate cases (
> see
> >>> reviewers guide )
> >>> >
> >>> > IMO,
> >>> > --Matt
> >>> >
> >>> > On 12/20/16, 12:51 PM, "James Sirota" 
> wrote:
> >>> >
> >>> > Good feedback. Here is the next iteration that accounts for
> >>> your suggestions:
> >>> > https://cwiki.apache.org/confluence/pages/viewpage.action?p
> >>> ageId=61332235
> >>> >
> >>> > 1. How To Contribute
> >>> > We are always very happy to have contributions, whether for
> >>> trivial cleanups, little additions or big new features.
> >>> > If you don't know Java or Scala you can still contribute to
> >>> the project. We strongly value documentation and gladly accept
> improvements
> >>> to the documentation.
> >>> > 1.1 Contributing A Code Change
> >>> > To submit a change for inclusion, please do the following:
> >>> > If there is not already a JIRA associated with your pull
> >>> request, create it, assign it to yourself, and start progress
> >>> > If there is a JIRA already created for your change then
> assign
> >>> it to yourself and start progress
> >>> > If you don't have access to JIRA or can't assign an issue to
> >>> yourself, please message 

Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Casey Stella
Oh, one more, I propose the following addition to 2.5:
>
> JIRAs will have a description of how to exercise the functionality in a
> step-by-step manner on a Quickdev vagrant instance to aid review and
> validation.


When Mike, Otto and I moved the system to the current version of Storm, we
needed a broader smoke test than just running data through that exercised a
variety of the features. We pulled those smoke tests from the various
discussions in the JIRAs.



On Wed, Dec 21, 2016 at 9:38 AM, Casey Stella  wrote:

> We have been having a lively discussion on METRON-590 (see
> https://github.com/apache/incubator-metron/pull/395) around creating
> multiple abstractions to do the same (or very nearly the same) thing.
>
> I'd like to propose an addition to section 2.3 which reads:
>
>> Contributions which provide abstractions which are either very similar to
>> or a subset of existing abstractions should use and extend existing
>> abstractions rather than provide competing abstractions unless engineering
>> exigencies (e.g. performance ) make such an operation impossible without
>> compromising core functionality of the platform.
>
>
> I'd like to suggest the following anecdote from the early years of the
> codebase to justify the above:
>
> Stellar started as a predicate language only for threat triage rules. As
> such, when the task of creating Field Transformations came to me, I needed
> something like Stellar except I needed it to return arbitrary objects,
> rather than just booleans. In my infinite wisdom, I chose to fork the
> language, create a second, more specific DSL for field transformations,
> thereby creating "Metron Query Language" and "Metron Transformation
> Language."
>
> I felt a nagging feeling at the time that I should just expand the query
> language, but I convinced myself that it would require too much testing and
> it would be a change that was too broad in scope. It took 3 months for me
> to get around to unifying those languages and if we had more people using
> it, it would have been an absolute nightmare.
>
> On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella  wrote:
>
>> Yeah, I +1 the notion of thorough automated tests.
>>
>> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley  wrote:
>>
>>> Hard to mark diffs in text-only mode :-)  My proposed change is:
>>>
>>> >> All merged patches will be reviewed with the expectation that
>>> thorough automated tests shall be provided and are consistent with …
>>>
>>>
>>>  ^^
>>> Added word “thorough” and changed “exist” to “shall be provided”.
>>> Thanks,
>>> --Matt
>>>
>>> On 12/20/16, 1:22 PM, "James Sirota"  wrote:
>>>
>>> Hi Matt, thats already in there. See last bullet point of 2.6
>>>
>>> 20.12.2016, 14:14, "Matt Foley" :
>>> > If we aren't going to require 100% test coverage for new code,
>>> then we should at least say "thorough" automated tests, in the last bullet
>>> of 2.6. And it should be a mandate not an assumption:
>>> >
>>> > All merged patches will be reviewed with the expectation that
>>> thorough automated tests shall be provided and are consistent with project
>>> testing methodology and practices, and cover the appropriate cases ( see
>>> reviewers guide )
>>> >
>>> > IMO,
>>> > --Matt
>>> >
>>> > On 12/20/16, 12:51 PM, "James Sirota"  wrote:
>>> >
>>> > Good feedback. Here is the next iteration that accounts for
>>> your suggestions:
>>> > https://cwiki.apache.org/confluence/pages/viewpage.action?p
>>> ageId=61332235
>>> >
>>> > 1. How To Contribute
>>> > We are always very happy to have contributions, whether for
>>> trivial cleanups, little additions or big new features.
>>> > If you don't know Java or Scala you can still contribute to
>>> the project. We strongly value documentation and gladly accept improvements
>>> to the documentation.
>>> > 1.1 Contributing A Code Change
>>> > To submit a change for inclusion, please do the following:
>>> > If there is not already a JIRA associated with your pull
>>> request, create it, assign it to yourself, and start progress
>>> > If there is a JIRA already created for your change then assign
>>> it to yourself and start progress
>>> > If you don't have access to JIRA or can't assign an issue to
>>> yourself, please message dev@metron.incubator.apache.org and someone
>>> will either give you permission or assign a JIRA to you
>>> > If you are introducing a completely new feature or API it is a
>>> good idea to start a discussion and get consensus on the basic design
>>> first. Larger changes should be discussed on the dev boards before
>>> submission.
>>> > New features and significant bug fixes should be documented in
>>> the JIRA and appropriate architecture diagrams should 

[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
@nickwallen Your suggestion is fine by me.  I hate to put the onus on you 
like this, but I think I like your suggestion best of all because it will 
result in smaller and more targeted review of your work.  Thanks for bearing up 
like this; your contribution is really great and we very much want it 
incorporated. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: [DISCUSS] Coding Guidelines

2016-12-21 Thread Casey Stella
We have been having a lively discussion on METRON-590 (see
https://github.com/apache/incubator-metron/pull/395) around creating
multiple abstractions to do the same (or very nearly the same) thing.

I'd like to propose an addition to section 2.3 which reads:

> Contributions which provide abstractions which are either very similar to
> or a subset of existing abstractions should use and extend existing
> abstractions rather than provide competing abstractions unless engineering
> exigencies (e.g. performance ) make such an operation impossible without
> compromising core functionality of the platform.


I'd like to suggest the following anecdote from the early years of the
codebase to justify the above:

Stellar started as a predicate language only for threat triage rules. As
such, when the task of creating Field Transformations came to me, I needed
something like Stellar except I needed it to return arbitrary objects,
rather than just booleans. In my infinite wisdom, I chose to fork the
language, create a second, more specific DSL for field transformations,
thereby creating "Metron Query Language" and "Metron Transformation
Language."

I felt a nagging feeling at the time that I should just expand the query
language, but I convinced myself that it would require too much testing and
it would be a change that was too broad in scope. It took 3 months for me
to get around to unifying those languages and if we had more people using
it, it would have been an absolute nightmare.

On Wed, Dec 21, 2016 at 9:31 AM, Casey Stella  wrote:

> Yeah, I +1 the notion of thorough automated tests.
>
> On Tue, Dec 20, 2016 at 4:36 PM, Matt Foley  wrote:
>
>> Hard to mark diffs in text-only mode :-)  My proposed change is:
>>
>> >> All merged patches will be reviewed with the expectation that thorough
>> automated tests shall be provided and are consistent with …
>>
>>  
>>  ^^
>> Added word “thorough” and changed “exist” to “shall be provided”.
>> Thanks,
>> --Matt
>>
>> On 12/20/16, 1:22 PM, "James Sirota"  wrote:
>>
>> Hi Matt, thats already in there. See last bullet point of 2.6
>>
>> 20.12.2016, 14:14, "Matt Foley" :
>> > If we aren't going to require 100% test coverage for new code, then
>> we should at least say "thorough" automated tests, in the last bullet of
>> 2.6. And it should be a mandate not an assumption:
>> >
>> > All merged patches will be reviewed with the expectation that
>> thorough automated tests shall be provided and are consistent with project
>> testing methodology and practices, and cover the appropriate cases ( see
>> reviewers guide )
>> >
>> > IMO,
>> > --Matt
>> >
>> > On 12/20/16, 12:51 PM, "James Sirota"  wrote:
>> >
>> > Good feedback. Here is the next iteration that accounts for
>> your suggestions:
>> > https://cwiki.apache.org/confluence/pages/viewpage.action?
>> pageId=61332235
>> >
>> > 1. How To Contribute
>> > We are always very happy to have contributions, whether for
>> trivial cleanups, little additions or big new features.
>> > If you don't know Java or Scala you can still contribute to the
>> project. We strongly value documentation and gladly accept improvements to
>> the documentation.
>> > 1.1 Contributing A Code Change
>> > To submit a change for inclusion, please do the following:
>> > If there is not already a JIRA associated with your pull
>> request, create it, assign it to yourself, and start progress
>> > If there is a JIRA already created for your change then assign
>> it to yourself and start progress
>> > If you don't have access to JIRA or can't assign an issue to
>> yourself, please message dev@metron.incubator.apache.org and someone
>> will either give you permission or assign a JIRA to you
>> > If you are introducing a completely new feature or API it is a
>> good idea to start a discussion and get consensus on the basic design
>> first. Larger changes should be discussed on the dev boards before
>> submission.
>> > New features and significant bug fixes should be documented in
>> the JIRA and appropriate architecture diagrams should be attached. Major
>> features may require a vote.
>> > Note that if the change is related to user-facing protocols /
>> interface / configs, etc, you need to make the corresponding change on the
>> documentation as well.
>> > Craft a pull request following the guidelines in Section 2 of
>> this document
>> > Pull requests should be small to facilitate easier review.
>> Studies have shown that review quality falls off as patch size grows.
>> Sometimes this will result in many small PRs to land a single large feature.
>> > People will review and comment on your pull request. It is our
>> job to follow up on pull 

[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread nickwallen
Github user nickwallen commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
As to @mattf-horton suggestion, I had looked at the possibility of not 
using the BaseWindowedBolt and implementing that myself.  But that seems like 
worse technical debt to me.  I'd be worried about how Storm evolves the 
functionality in future versions, especially since the functionality is 
relatively new in Storm.  I'm sure you have the same concerns, you're just 
trying to help find some compromise solution here.

@cestella I see what you're going for, but don't see the logic in 
reimplementing what I already have working in `ZkConfigurationManager`.  What 
if I moved `ZkConfigurationManager` and the related classes to the 
`metron-profiler-common` project.  This would ensure that it could only be used 
exclusively by the Profiler.  Then as a follow-on PR, I will reach feature 
parity with ConfiguredBolt, move `ZkConfigurationManager` back to 
`metron-common`, update `ConfiguredBolt` to use `ZkConfigurationManager`, and 
all else that is needed there.

I would suggest another approach as an alternative.  This satisfies the 
concerns for multiple implementations, and results in smaller, more concise PRs.
1. Retract this PR
2. Enhance ZkConfigurationManager to reach feature parity with 
ConfiguredBolt, update `ConfiguredBolt` to use `ZkConfigurationManager`, then 
submit that work as a separate PR. 
3. Create a second PR for the new client-side Profiler functions as this 
has had its own lengthy discussion back-and-forth.
4. Create a third PR for the event time processing additions in the 
Profiler.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
I also wanted to mention one of the reasons I am particularly sensitive to 
having multiple components that do *very nearly* the same thing.  I made this 
mistake early on in the project (and will probably make it again, because..you 
know, I don't learn ;).  

Stellar started as a predicate language only for threat triage rules.  As 
such, when the task of creating Field Transformations came to me, I needed 
something like Stellar except I needed it to return arbitrary objects, rather 
than just booleans.  In my infinite wisdom, I chose to fork the language, 
create a second, more specific DSL for field transformations, thereby creating 
"Metron Query Language" and "Metron Transformation Language."  

I felt a nagging feeling at the time that I should just expand the query 
language, but I convinced myself that it would require too much testing and it 
would be a change that was too broad in scope.  It took 3 months for me to get 
around to unifying those languages and if we had more people using it, it would 
have been an absolute nightmare.  This may be a "once bitten, twice shy" thing, 
but I think it's a good policy in general.  Pardon the interlude; just wanted 
to give some context.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread cestella
Github user cestella commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
@mattf-horton @mmiklavc  First off, thanks for the perspective, both of 
you.  While I think we should investigate your suggestion, @mattf-horton , it 
appears on the face of it quite dependent upon implementation decisions and 
details inside of Storm.  That immediately gives me pause.  That being said, 
I'll repeat that I think we should consider it strongly.

I do agree with @mattf-horton that pulling out the `ZkConfigurationManager` 
here would be substantially more work if you consider the impact to our tests 
and the increased amount of testing that we would have to add to this PR.  I 
think, as @mmiklavc suggested, it's a separate PR.  I would prefer to get this 
really good work by @nickwallen in, so maybe I can offer another compromise 
position to go with @mattf-horton 's one earlier.

I'd like to propose another compromise.  The main concern that I have is 
leaving an second abstraction in place for managing configuration from 
zookeeper.  Given that, what do you think of the following:
* Remove the `ConfigurationManager` abstraction from this PR
* Since it is used only in one class now, directly use the 
`org.apache.curator.framework.recipes.cache.NodeCache` in `ProfilerSplitterBolt`
* Create a follow-on JIRA and make a comment with a TODO suggesting we 
replace this with the appropriate abstraction (referencing the JIRA).

This would remove the possibility of a fork and be substantially less 
work/impact.

Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron issue #395: METRON-590 Enable Use of Event Time in Profiler

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on the issue:

https://github.com/apache/incubator-metron/pull/395
  
Regarding ZkConfigurationManager and the above discussion thread:
This is a complex situation, but my best advice is as follows:

1.  It is definitely worthwhile to refactor a CuratorZkConfigManager class 
out of ConfiguredBolt.  In fact, when we do, we really should propose the Storm 
team adopt it; it would be tremendously valuable for lots of Storm 
applications.  HOWEVER, this is a significant task in its own right and more 
than a couple days' work.  I believe there is a much simpler solution for the 
short term, that either avoids or mostly avoids the split implementation 
problem (depending on a Storm detail that I wasn't able to determine), so I 
don't recommend we make Nick do it as part of this task.

2. I agree with @merrimanr and @cestella that we should not lose the 
established functionality of ConfiguredBolt for managing 
ProfilerConfigurations, nor create a new config management and caching 
mechanism, even with good intentions about putting it all back together in the 
future.

3. After analyzing what goes into the class hierarchies, I see that:
a) The way BaseWindowedBolt manages its windowConfiguration is 
qualitatively different from the Curated Zk based management used by 
ConfiguredBolt.  It would be hard to merge them, and that's what is causing the 
current problem.
b) But it isn't necessary, because the config management built into 
BaseWindowedBolt is just a distraction.

What's necessary to use Storm Windowing is that:
- the Bolt implements IWindowedBolt interface, 
- a certain set of 7 parameters (defined by example in BaseWindowedBolt) is 
presented to Storm when it queries the Bolt's getComponentConfiguration() 
method,
- and the Storm topology correctly identifies the Bolt as being a Windowed 
Bolt.

The only question in my mind is the last bullet.  Does Storm identify the 
Bolt as Windowed just because it presents the windowing parameters in 
getComponentConfiguration() ?  Or does it do class reflection on the Bolt?  If 
it uses reflection, is it looking for IWindowedBolt or for BaseWindowedBolt ?

It is likely that one of the first two is sufficient to identify the Bolt 
as Windowed.  If so, there's a very easy solution to all this.  Define:
```
public abstract class ConfiguredWindowedBolt 
extends ConfiguredBolt implements IWindowedBolt
{with only a few lines of content needed}
and
public abstract class ConfiguredProfilerBolt 
extends ConfiguredWindowedBolt 
```
This compiles on the master codeline; I can share my straw man if desired.
Then change `getComponentConfiguration()` in ProfileBuilderBolt, to return 
the 7 windowing-related config parameters instead of 
TOPOLOGY_TICK_TUPLE_FREQ_SECS.  Other than changes to add the windowing params 
to the ProfilerConfigurations on a per-bolt basis, the current config 
management code in ConfiguredProfilerBolt and ConfiguredBolt can be used just 
as is.

If Storm actually requires the Bolt to be of type BaseWindowedBolt, then it 
is a little uglier, but still doable:  Copy the methods of ConfiguredBolt into 
a new class that extends BaseWindowedBolt, simply overriding any conflicting 
methods already in BaseWindowedBolt.  We have to duplicate the code, but it can 
be copied verbatim, thereby giving much higher reliability than new code.
Again, there's no need to try to use the configuration management code 
built into BaseWindowedBolt; only what gets returned in 
`getComponentConfiguration()` is important.

I don't expect either of these solutions to give great joy, but it is 
workable and can be done reliably.  We don't need to refactor ConfiguredBolt, 
and its curated config management can be used whole.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #395: METRON-590 Enable Use of Event Time in P...

2016-12-21 Thread mattf-horton
Github user mattf-horton commented on a diff in the pull request:

https://github.com/apache/incubator-metron/pull/395#discussion_r93426555
  
--- Diff: 
metron-platform/metron-common/src/main/java/org/apache/metron/common/configuration/manager/ZkConfigurationManager.java
 ---
@@ -0,0 +1,129 @@
+/*
+ *
+ *  Licensed to the Apache Software Foundation (ASF) under one
+ *  or more contributor license agreements.  See the NOTICE file
+ *  distributed with this work for additional information
+ *  regarding copyright ownership.  The ASF licenses this file
+ *  to you under the Apache License, Version 2.0 (the
+ *  "License"); you may not use this file except in compliance
+ *  with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ *  Unless required by applicable law or agreed to in writing, software
+ *  distributed under the License is distributed on an "AS IS" BASIS,
+ *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+ *  See the License for the specific language governing permissions and
+ *  limitations under the License.
+ *
+ */
+
+package org.apache.metron.common.configuration.manager;
+
+import org.apache.curator.framework.CuratorFramework;
+import org.apache.curator.framework.recipes.cache.NodeCache;
+import org.apache.curator.utils.CloseableUtils;
+import org.apache.metron.common.utils.JSONUtils;
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Optional;
+
+import static org.apache.commons.lang.ArrayUtils.isNotEmpty;
+
+/**
+ * Responsible for managing configuration values that are created, 
persisted, and updated
+ * in Zookeeper.
+ */
+public class ZkConfigurationManager implements ConfigurationManager {
--- End diff --

This thread has gotten unmanageably long.  I'm going to start a new thread 
at the bottom of the page.  Please forgive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-metron pull request #400: METRON-636: Capture memory and cpu detai...

2016-12-21 Thread anandsubbu
GitHub user anandsubbu opened a pull request:

https://github.com/apache/incubator-metron/pull/400

METRON-636: Capture memory and cpu details as a part of platform-info…

… script

Added top commands for Linux and Mac to be captured as a part of the 
platform-info.sh script.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/anandsubbu/incubator-metron METRON-636

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-metron/pull/400.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #400


commit c5b9168740308a65f74229f2fde650138a675173
Author: Anand Subramanian 
Date:   2016-12-21T08:21:51Z

METRON-636: Capture memory and cpu details as a part of platform-info script




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---