[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-12-09 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r240064005
  
--- Diff: metron-platform/metron-parsers/README.md ---
@@ -52,6 +52,62 @@ There are two general types types of parsers:
This is using the default value for `wrapEntityName` if that 
property is not set.
 * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
 * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
+  * Regular Expressions Parser
+  * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
+  * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
+  * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
+  For example, following convertions will automatically happen:
+
+  ```
+  ipSrcAddr -> ip_src_addr
+  ipDstAddr -> ip_dst_addr
+  ipSrcPort -> ip_src_port
+  ```
+  Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
+  
+  * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
+  
+  A complete configuration example would look like:
+  
+  ```json
+  "convertCamelCaseToUnderScore": true, 
+  "recordTypeRegex": "kernel|syslog",
+  "messageHeaderRegex": 
"((<=^)\\d{1,4}(?=>)).*?((<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?((<=\\s).*?(?=\\s))",
--- End diff --

I have added this explanation to the README. Thanks for the suggestion.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-12-09 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r240063655
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,152 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertTrue;
+
+public class RegularExpressionsParserTest {
+
+  private RegularExpressionsParser regularExpressionsParser;
+  private JSONObject parserConfig;
+
+  @Before
+  public void setUp() throws Exception {
+regularExpressionsParser = new RegularExpressionsParser();
+  }
+
+  @Test
+  public void testSSHDParse() throws Exception {
+String message =
+"<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey 
for prod from 22.22.22.22 port 5 ssh2";
+
+parserConfig = getJsonConfig(
+
Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
--- End diff --

I have added the timestamp field to the parser and also have added the more 
targeted configuration using @Multiline now.

Will try to add Integration tests as well.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-12-06 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r239664781
  
--- Diff: metron-platform/metron-parsers/README.md ---
@@ -52,6 +52,62 @@ There are two general types types of parsers:
This is using the default value for `wrapEntityName` if that 
property is not set.
 * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
 * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
+  * Regular Expressions Parser
+  * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
+  * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
+  * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
+  For example, following convertions will automatically happen:
+
+  ```
+  ipSrcAddr -> ip_src_addr
+  ipDstAddr -> ip_dst_addr
+  ipSrcPort -> ip_src_port
+  ```
+  Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
+  
+  * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
+  
+  A complete configuration example would look like:
+  
+  ```json
+  "convertCamelCaseToUnderScore": true, 
+  "recordTypeRegex": "kernel|syslog",
+  "messageHeaderRegex": 
"((<=^)\\d{1,4}(?=>)).*?((<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?((<=\\s).*?(?=\\s))",
+  "fields": [
+{
+  "recordType": "kernel",
+  "regex": ".*((<=\\]|\\w\\:).*?(?=$))"
+},
+{
+  "recordType": "syslog",
+  "regex": 
".*((<=PID\\s=\\s).*?(?=\\sLine)).*((<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))
(.*?(?=\")).*((<=\").*?(?=$))"
+}
+  ]
+  ```
+  **Note**: messageHeaderRegex and regex (withing fields) could be 
specified as lists also e.g.
+  ```json
+  "messageHeaderRegex": [
+  "regular expression 1",
+  "regular expression 2"
+  ]
+  ```
+  Where **regular expression 1** are valid regular expressions and may 
have named
+  groups, which would be extracted into fields. This list will be 
evaluated in order until a
+  matching regular expression is found.
+  
+  **recordTypeRegex** can be a more advanced regular expression 
containing named goups. For example
--- End diff --

Thanks. I will update the documentation.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237718325
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,152 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertTrue;
+
+public class RegularExpressionsParserTest {
+
+  private RegularExpressionsParser regularExpressionsParser;
+  private JSONObject parserConfig;
+
+  @Before
+  public void setUp() throws Exception {
+regularExpressionsParser = new RegularExpressionsParser();
+  }
+
+  @Test
+  public void testSSHDParse() throws Exception {
+String message =
+"<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey 
for prod from 22.22.22.22 port 5 ssh2";
+
+parserConfig = getJsonConfig(
+
Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
--- End diff --

It could be failing because this parser does not add "timestamp" in the 
parsed json. In our usecase we add timestamp using stellar. I will update the 
parser to add a default current system timestamp.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237716600
  
--- Diff: metron-platform/metron-parsers/README.md ---
@@ -52,6 +52,62 @@ There are two general types types of parsers:
This is using the default value for `wrapEntityName` if that 
property is not set.
 * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
 * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
+  * Regular Expressions Parser
+  * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
+  * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
+  * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
+  For example, following convertions will automatically happen:
+
+  ```
+  ipSrcAddr -> ip_src_addr
+  ipDstAddr -> ip_dst_addr
+  ipSrcPort -> ip_src_port
+  ```
+  Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
+  
+  * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
+  
+  A complete configuration example would look like:
+  
+  ```json
+  "convertCamelCaseToUnderScore": true, 
+  "recordTypeRegex": "kernel|syslog",
+  "messageHeaderRegex": 
"((<=^)\\d{1,4}(?=>)).*?((<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?((<=\\s).*?(?=\\s))",
+  "fields": [
+{
+  "recordType": "kernel",
+  "regex": ".*((<=\\]|\\w\\:).*?(?=$))"
+},
+{
+  "recordType": "syslog",
+  "regex": 
".*((<=PID\\s=\\s).*?(?=\\sLine)).*((<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))
(.*?(?=\")).*((<=\").*?(?=$))"
--- End diff --

I would say not any syslog message is expected to contain these feilds. But 
it is expected that from **this form** of syslog message, we would extract 
these fields (processid, fileName, filePath and eventInfo).

This configuration has been extracted from our use case. Our security 
experts found this form of syslog message to be important from security 
perspective. Now there could be other forms of syslog messages which we dont 
care about. 


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237715657
  
--- Diff: metron-platform/metron-parsers/README.md ---
@@ -52,6 +52,62 @@ There are two general types types of parsers:
This is using the default value for `wrapEntityName` if that 
property is not set.
 * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
 * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
+  * Regular Expressions Parser
+  * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
+  * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
+  * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
+  For example, following convertions will automatically happen:
+
+  ```
+  ipSrcAddr -> ip_src_addr
+  ipDstAddr -> ip_dst_addr
+  ipSrcPort -> ip_src_port
+  ```
+  Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
+  
+  * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
+  
+  A complete configuration example would look like:
+  
+  ```json
+  "convertCamelCaseToUnderScore": true, 
+  "recordTypeRegex": "kernel|syslog",
+  "messageHeaderRegex": 
"((<=^)\\d{1,4}(?=>)).*?((<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?((<=\\s).*?(?=\\s))",
--- End diff --

1. Yes, messageHeaderRegex is run on all the messages. 
2. Yes, all the messages are expected to contain three fields in this case.
So messageHeaderRegex is a sort of HCF in all messages.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237715210
  
--- Diff: metron-platform/metron-parsers/README.md ---
@@ -52,6 +52,62 @@ There are two general types types of parsers:
This is using the default value for `wrapEntityName` if that 
property is not set.
 * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
 * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
+  * Regular Expressions Parser
+  * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
+  * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
+  * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
+  For example, following convertions will automatically happen:
+
+  ```
+  ipSrcAddr -> ip_src_addr
+  ipDstAddr -> ip_dst_addr
+  ipSrcPort -> ip_src_port
+  ```
+  Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
+  
+  * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
+  
+  A complete configuration example would look like:
+  
+  ```json
+  "convertCamelCaseToUnderScore": true, 
+  "recordTypeRegex": "kernel|syslog",
+  "messageHeaderRegex": 
"((<=^)\\d{1,4}(?=>)).*?((<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?((<=\\s).*?(?=\\s))",
+  "fields": [
+{
+  "recordType": "kernel",
+  "regex": ".*((<=\\]|\\w\\:).*?(?=$))"
+},
+{
+  "recordType": "syslog",
+  "regex": 
".*((<=PID\\s=\\s).*?(?=\\sLine)).*((<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))
(.*?(?=\")).*((<=\").*?(?=$))"
+}
+  ]
+  ```
+  **Note**: messageHeaderRegex and regex (withing fields) could be 
specified as lists also e.g.
+  ```json
+  "messageHeaderRegex": [
+  "regular expression 1",
+  "regular expression 2"
+  ]
+  ```
+  Where **regular expression 1** are valid regular expressions and may 
have named
+  groups, which would be extracted into fields. This list will be 
evaluated in order until a
+  matching regular expression is found.
+  
+  **recordTypeRegex** can be a more advanced regular expression 
containing named goups. For example
--- End diff --

Though having named group in recordType is completely optional, still you 
could want to use a namedGroup in recordType for followring reasons:

1. Since **recordType** regular expression is already getting matched and 
we are paying the price for a regular expression match already, we can extract 
certain fields as a by product of this match.
2. Most likely the recordType field is common across all the messages. 
Hence having it extracted in the **recordType** (or **messageHeaderRegex**) 
would reduce the overall complexity of regular expressions in the **regex** 
field.

Again, it is a personal choice on how to craft your parser configuration. 
These are just the options given to user.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237714079
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,152 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertTrue;
+
+public class RegularExpressionsParserTest {
+
+  private RegularExpressionsParser regularExpressionsParser;
+  private JSONObject parserConfig;
+
+  @Before
+  public void setUp() throws Exception {
+regularExpressionsParser = new RegularExpressionsParser();
+  }
+
+  @Test
+  public void testSSHDParse() throws Exception {
+String message =
+"<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey 
for prod from 22.22.22.22 port 5 ssh2";
+
+parserConfig = getJsonConfig(
+
Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
+regularExpressionsParser.configure(parserConfig);
+JSONObject parsed = parse(message);
+// Expected
+Map expectedJson = new HashMap<>();
+expectedJson.put("device_name", "deviceName");
+expectedJson.put("dst_process_name", "sshd");
+expectedJson.put("dst_process_id", "11672");
+expectedJson.put("dst_user_id", "prod");
+expectedJson.put("ip_src_addr", "22.22.22.22");
+expectedJson.put("ip_src_port", "5");
+expectedJson.put("app_protocol", "ssh2");
+assertTrue(validate(expectedJson, parsed));
+
+  }
+
+  @Test
+  public void testNoMessageHeaderRegex() throws Exception {
+String message =
+"<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey 
for prod from 22.22.22.22 port 5 ssh2";
+parserConfig = getJsonConfig(
+
Paths.get("src/test/resources/config/RegularExpressionsNoMessageHeaderParserConfig.json")
+.toString());
+regularExpressionsParser.configure(parserConfig);
+JSONObject parsed = parse(message);
+// Expected
+Map expectedJson = new HashMap<>();
+expectedJson.put("dst_process_name", "sshd");
+expectedJson.put("dst_process_id", "11672");
+expectedJson.put("dst_user_id", "prod");
+expectedJson.put("ip_src_addr", "22.22.22.22");
+expectedJson.put("ip_src_port", "5");
+expectedJson.put("app_protocol", "ssh2");
+assertTrue(validate(expectedJson, parsed));
--- End diff --

I personally found junit logging to be insufficient. I wanted more 
information in the logs. Also expectedJson.put("ip_src_port", "5"); was 
more concise than its counterpart.  

Other advantage of using this method was it would let you know all the 
failed scenarios in one run. While a failed JUnit assertion will stop the test 
case then and there itself. 

Also, Junit best practices state that maximum one assertion per test case. 
Now if we want to follow this best practice, we will have to write a unit test 
per field which again does not feel right. Having the validate method let us 
follow the Junit best practices.

Do you still want me to remove validate method ?


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237708788
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,152 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertTrue;
+
+public class RegularExpressionsParserTest {
+
+  private RegularExpressionsParser regularExpressionsParser;
+  private JSONObject parserConfig;
+
+  @Before
+  public void setUp() throws Exception {
+regularExpressionsParser = new RegularExpressionsParser();
+  }
+
+  @Test
+  public void testSSHDParse() throws Exception {
+String message =
+"<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey 
for prod from 22.22.22.22 port 5 ssh2";
+
+parserConfig = getJsonConfig(
+
Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
+regularExpressionsParser.configure(parserConfig);
+JSONObject parsed = parse(message);
+// Expected
+Map expectedJson = new HashMap<>();
+expectedJson.put("device_name", "deviceName");
+expectedJson.put("dst_process_name", "sshd");
+expectedJson.put("dst_process_id", "11672");
+expectedJson.put("dst_user_id", "prod");
+expectedJson.put("ip_src_addr", "22.22.22.22");
+expectedJson.put("ip_src_port", "5");
+expectedJson.put("app_protocol", "ssh2");
--- End diff --

Sure will do that.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237707637
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,152 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+import static org.junit.Assert.assertTrue;
+
+public class RegularExpressionsParserTest {
+
+  private RegularExpressionsParser regularExpressionsParser;
+  private JSONObject parserConfig;
+
+  @Before
+  public void setUp() throws Exception {
+regularExpressionsParser = new RegularExpressionsParser();
+  }
+
+  @Test
+  public void testSSHDParse() throws Exception {
+String message =
+"<38>Jun 20 15:01:17 deviceName sshd[11672]: Accepted publickey 
for prod from 22.22.22.22 port 5 ssh2";
+
+parserConfig = getJsonConfig(
+
Paths.get("src/test/resources/config/RegularExpressionsParserConfig.json").toString());
--- End diff --

Actually, parser did parse the message. If you look at the raw_message, it 
is actually the parsed_message. Nowe certainly there is something weird here. 
Not sure why REPL thinks that parser failed and not sure why REPL is putting 
the successfully parsed message into raw_message field. As the parser itself 
has no relation to raw_message field, I think something is wrong with REPL. 
This is the parsed message extracted from the REPL output. So certainly REPL 
got this output from parser. The only way it could have got this output from 
parser is when parser successfully returned from the **parse** methiod.

```
{
"dst_process_id": "11672",
"dst_process_name": "sshd",
"source.type": "regex",
"device_name": "deviceName",
"original_string": "<38>Jun 20 15:01:17 deviceName sshd[11672]: 
Accepted publickey for prod from 22.22.22.22 port 5 ssh2",
"event_info": "Accepted publickey",
"ip_src_port": "5",
"dst_user_id": "prod",
"app_protocol": "ssh2",
"guid": "edaee82d-02fb-4ec9-9412-5912fa8d4a6f",
"syslogpriority": "38",
"timestamp_device_original": "Jun 20 15:01:17",
"ip_src_addr": "22.22.22.22"
}
```

Regarding changing the configuration to use @Multiline, I will do that.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237699908
  
--- Diff: metron-platform/metron-parsers/README.md ---
@@ -52,6 +52,62 @@ There are two general types types of parsers:
This is using the default value for `wrapEntityName` if that 
property is not set.
 * `wrapEntityName` : Sets the name to use when wrapping JSON using 
`wrapInEntityArray`.  The `jsonpQuery` should reference this name.
 * A field called `timestamp` is expected to exist and, if it does not, 
then current time is inserted.  
+  * Regular Expressions Parser
+  * `recordTypeRegex` : A regular expression to uniquely identify a 
record type.
+  * `messageHeaderRegex` : A regular expression used to extract fields 
from a message part which is common across all the messages.
+  * `convertCamelCaseToUnderScore` : If this property is set to true, 
this parser will automatically convert all the camel case property names to 
underscore seperated. 
+  For example, following convertions will automatically happen:
+
+  ```
+  ipSrcAddr -> ip_src_addr
+  ipDstAddr -> ip_dst_addr
+  ipSrcPort -> ip_src_port
+  ```
+  Note this property may be necessary, because java does not 
support underscores in the named group names. So in case your property naming 
conventions requires underscores in property names, use this property.
+  
+  * `fields` : A json list of maps contaning a record type to regular 
expression mapping.
+  
+  A complete configuration example would look like:
+  
+  ```json
+  "convertCamelCaseToUnderScore": true, 
+  "recordTypeRegex": "kernel|syslog",
+  "messageHeaderRegex": 
"((<=^)\\d{1,4}(?=>)).*?((<=>)[A-Za-z] 
{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?((<=\\s).*?(?=\\s))",
+  "fields": [
+{
+  "recordType": "kernel",
+  "regex": ".*((<=\\]|\\w\\:).*?(?=$))"
+},
+{
+  "recordType": "syslog",
+  "regex": 
".*((<=PID\\s=\\s).*?(?=\\sLine)).*((<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))
(.*?(?=\")).*((<=\").*?(?=$))"
+}
+  ]
+  ```
+  **Note**: messageHeaderRegex and regex (withing fields) could be 
specified as lists also e.g.
--- End diff --

Following is an example where regex is a list:
{```
  "recordType": "STARTSAVECONFIG",
  "regex": [

".*(?(?<=\\s).*?(?=\\s\\d{1,7}-\\w{1,10}-\\d{1,7})).*?(?(?
 
 <=\\s\\d{1,7}\\s:\\s).*?(?=$)).*$",

".*(?(?<=\\s).*?(?=\\s\\d{1,7}-\\w{1,10}-\\d{1,7})).*?(?(?<=\\s:\\s).*?(?=$)).*$"
]
}
```
A list should be chosen when there are multiple forms of a particular 
record type. 

If there is only one form of a record type (for example in case of Cisco 
ASA), then there is no need to have a list.  **regex** field can be specified 
in a string as only a single regular expression is required per **recordType**. 
For example

```
{
"recordType": "APPFW APPFW_FIELDFORMAT",
 "regex": 
".*(?(?<=\\s).*?(?=\\s\\d{1,7}-\\w{1,10}-\\d{1,7})).*?(?(?<=\\s\\d{1,7}\\s:\\s{1,2}).*?(?=\\s)).*?(?(?<=\\s)\\d+(?=\\-)).*?(?(?<=\\-\\w{1,10}\\s).*?(?=\\s)).*?(?(?<=\\s).*?(?=\\s)).*?(?(?<=\\s).*?(?=\\s)).*?(?(?<=\\s).*?(?=\\s\\<)).*?(?(?<=\\<).*?(?=\\>)).*$"
}
```


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-29 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r237698406
  
--- Diff: 
metron-platform/metron-parsers/src/test/resources/config/RegularExpressionsInvalidParserConfig.json
 ---
@@ -0,0 +1,208 @@
+{
+  "convertCamelCaseToUnderScore": true,
+  "messageHeaderRegex": 
"(?(?<=^<)\\d{1,4}(?=>)).*?(?(?<=>)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?(?<=\\s).*?(?=\\s))",
+  "recordTypeRegex": 
"(?(?<=\\s)\\b(tch-replicant|audispd|syslog|ntpd|sendmail|pure-ftpd|usermod|useradd|anacron|unix_chkpwd|sudo|dovecot|postfix\\/smtpd|postfix\\/smtp|postfix\\/qmgr|klnagent|systemd|(?i)crond(?-i)|clamd|kesl|sshd|run-parts|automount|suexec|freshclam|kernel|vsftpd|ftpd|su)\\b(?=\\[|:))",
+  "fields": [
+{
+  "recordType": "syslog",
+  "regex": 
".*(?(?<=PID\\s=\\s).*?(?=\\sLine)).*(?(?<=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?.*?(?=\")).*(?(?<=\").*?(?=$))"
+},
+{
+  "recordType": "pure-ftpd",
+  "regex": 
".*(?(?<=\\:\\s\\().*?(?=\\)\\s)).*?(?(?<=\\s\\[).*?(?=\\]\\s)).*?(?(?<=\\]\\s).*?(?=$))"
+},
+{
+  "recordType": "systemd",
+  "regex": [
+
".*(?(?<=\\ssystemd\\:\\s).*?(?=\\d+)).*?(?(?<=\\sSession\\s).*?(?=\\sof)).*?(?(?<=\\suser\\s).*?(?=\\.)).*$",
+
".*(?(?<=\\ssystemd\\:\\s).*?(?=\\sof)).*?(?(?<=\\sof\\s).*?(?=\\.)).*$"
+  ]
+},
+{
+  "recordType": "kesl",
+  "regex": ".*(?(?<=\\:).*?(?=$))"
+},
+{
+  "recordType": "dovecot",
+  "regex": [
+
".*(?(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?(?<=\\:).*?(?=\\:\\suser)).*?(?(?<=user\\=\\<).*?(?=\\>)).*?(?(?<=rip\\=).*?(?=,)).*?(?(?<=lip\\=).*?(?=,)).*?(?(?<=,\\s).*?(?=,)).*?(?(?<=session\\=\\<).*?(?=\\>)).*$",
+
".*(?(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?(?<=\\:).*?(?=\\:\\srip)).*?(?(?<=rip\\=).*?(?=,)).*?(?(?<=lip\\=).*?(?=,)).*?(?(?<=,\\s).*?(?=$))",
+
".*(?(?<=\\sdovecot:\\s).*?(?=\\:)).*?(?(?<=\\:).*?(?=$))"
+  ]
+},
+{
+  "recordType": "postfix/smtpd",
+  "regex": [
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\:).*?(?=$))",
+
".*(?(?<=\\[).*?(?=\\]:)).*?(?(?<=\\:\\s)disconnect(?=\\sfrom)).*?(?(?<=from).*(?=\\[)).*?(?(?<=\\[).*(?=\\])).*$"
+  ]
+},
+{
+  "recordType": "postfix/smtp",
+  "regex": [
+
".*(?(?<=smtp\\[).*?(?=\\]:)).*(?(?<=to=#\\<).*?(?=>,)).*(?(?<=relay=).*?(?=,)).*(?(?<=delay=).*?(?=,)).*(?(?<=delays=).*?(?=,)).*(?(?<=dsn=).*?(?=,)).*(?(?<=status=).*?(?=\\()).*?(?(?<=connect
 
to).*?(?=\\[)).*?(?(?<=\\[).*?(?=\\])).*?(?(?<=\\]:).*?(?=:\\s)).*?(?(?<=:\\s).*?(?=$))",
+
".*(?(?<=smtp\\[).*?(?=\\]:)).*?(?(?<=connect 
to).*?(?=\\[)).*?(?(?<=\\[).*?(?=\\])).*(?(?<=:).*?(?=\\s)).*(?(?<=\\s).*?(?=$))",
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\:).*?(?=$))"
+  ]
+},
+{
+  "recordType": "crond",
+  "regex": [
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?(?<=CMD\\s\\().*?(?=\\))).*$",
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?(?<=\\().*?(?=\\))).*$",
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\]:\\s\\().*?(?=\\)\\s)).*?(?(?<=CMD\\s\\().*?(?=\\))).*$",
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\:).*?(?=$))"
+  ]
+},
+{
+  "recordType": "clamd",
+  "regex": [
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\:\\s).*?(?=\\:)).*?(?(?<=\\:).*?(?=$))",
+
".*(?(?<=\\:\\s).*?(?=\\:)).*?(?(?<=\\:).*?(?=$))"
+  ]
+},
+{
+  "recordType": "run-parts",
+  "regex": ".*(?(?<=\\sparts).*?(?=$))"
+},
+{
+  "recordType": "sshd",
+  "regex": [
+
".*(?(?<=\\[).*?(?=\\])).*?(?(?<=\\]:\\s).*?(?=\\sfor)).*?(?(?<=\\sfor\\s).*?(?=\\sfrom)).*?(?(?<=\\sfrom\\s).*?(?=\\sport)).*?(?(?<=\\sport\\

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-20 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r23519
  
--- Diff: 
metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java
 ---
@@ -127,5 +127,40 @@ public String getType() {
 }
   }
 
+   public enum ParserConfigConstants {
--- End diff --

As suggested, moved ParserConfigConstants as inner enum in the 
RegularExpressionsParser class.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234873094
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234872968
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,118 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import static org.junit.Assert.assertTrue;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.apache.metron.parsers.regex.RegularExpressionsParser;
+
+public class RegularExpressionsParserTest {
+private RegularExpressionsParser regularExpressionsParser;
--- End diff --

Not sure what was wrong here. I have configured the google style java code 
formatter in intelliJ idea. If it was about a line break after class 
declaration, then I have taken care of that.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234872742
  
--- Diff: 
metron-platform/metron-common/src/main/java/org/apache/metron/common/Constants.java
 ---
@@ -127,5 +127,48 @@ public String getType() {
 }
   }
 
+   public static enum ParserConfigConstants {
+//@formatter:off
+RECORD_TYPE("recordType"),
+RECORD_TYPE_REGEX("recordTypeRegex"),
+REGEX("regex"),
+FIELDS("fields"),
+MESSAGE_HEADER("messageHeaderRegex"),
+ORIGINAL("original_string"),
+TIMESTAMP("timestamp"),
+CONVERT_CAMELCASE_TO_UNDERSCORE("convertCamelCaseToUnderScore");
+//@formatter:on
+private final String name;
+private static Map nameToField;
+
+static {
+  nameToField = new HashMap<>();
+  for (final ParserConfigConstants f : ParserConfigConstants.values()) 
{
+nameToField.put(f.getName(), f);
+  }
+}
+
+
+ParserConfigConstants(String name) {
+  this.name = name;
+}
+
+public String getName() {
+  return name;
+}
+
+static {
--- End diff --

Removed the duplicate.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234872602
  
--- Diff: 
metron-platform/metron-parsers/src/test/java/org/apache/metron/parsers/regex/RegularExpressionsParserTest.java
 ---
@@ -0,0 +1,118 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.metron.parsers.regex;
+
+import static org.junit.Assert.assertTrue;
+
+import java.nio.file.Files;
+import java.nio.file.Paths;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import org.json.simple.JSONObject;
+import org.json.simple.parser.JSONParser;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+import org.apache.metron.parsers.regex.RegularExpressionsParser;
+
+public class RegularExpressionsParserTest {
+private RegularExpressionsParser regularExpressionsParser;
+private JSONObject parserConfig;
+
+@Test
--- End diff --

I have added more unit tests. Header regex being empty is a perfectly valid 
scenario and I have added a unit test to support that. A missing 
recordTypeRegex or an invalid regex is not a valid scenario and this invalid 
config will be detected during topology initialization phase only. I have added 
relevant unit tests for these scenarios as well.


---


[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234872641
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234871911
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234871662
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234871255
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234871183
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-11-19 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on a diff in the pull request:

https://github.com/apache/metron/pull/1245#discussion_r234870826
  
--- Diff: 
metron-platform/metron-parsers/src/main/java/org/apache/metron/parsers/regex/RegularExpressionsParser.java
 ---
@@ -0,0 +1,427 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more 
contributor license
+ * agreements. See the NOTICE file distributed with this work for 
additional information regarding
+ * copyright ownership. The ASF licenses this file to you under the Apache 
License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance with the 
License. You may obtain a
+ * copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software 
distributed under the License
+ * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF 
ANY KIND, either express
+ * or implied. See the License for the specific language governing 
permissions and limitations under
+ * the License.
+ */
+
+package org.apache.metron.parsers.regex;
+
+import java.nio.charset.Charset;
+import java.text.ParseException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.commons.lang3.StringUtils;
+import org.apache.metron.common.Constants;
+import org.apache.metron.parsers.BasicParser;
+import org.apache.metron.common.Constants.ParserConfigConstants;
+import org.json.simple.JSONObject;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+//@formatter:off
+/**
+ * General purpose class to parse unstructured text message into a json 
object. This class parses
+ * the message as per supplied parser config as part of sensor config. 
Sensor parser config example:
+ *
+ * 
+ * 
+ * "convertCamelCaseToUnderScore": true,
+ * "recordTypeRegex": 
"(?process(?=\\s)\\b(kernel|syslog)\\b(?=\\[|:))",
+ * "messageHeaderRegex": 
"(?syslogpriority(?=^)\\d{1,4}(?=)).*?(?timestamp>(?=)[A-Za-z]{3}\\s{1,2}\\d{1,2}\\s\\d{1,2}:\\d{1,2}:\\d{1,2}(?=\\s)).*?(?syslogHost(?=\\s).*?(?=\\s))",
+ * "fields": [
+ * {
+ * "recordType": "kernel",
+ * "regex": ".*(?eventInfo(?=\\]|\\w\\:).*?(?=$))"
+ * },
+ * {
+ * "recordType": "syslog",
+ * "regex": 
".*(?processid(?=PID\\s=\\s).*?(?=\\sLine)).*(?filePath(?=64\\s)\/([A-Za-z0-9_-]+\/)+(?=\\w))(?fileName.*?(?=\")).*(?eventInfo(?=\").*?(?=$))"
+ * }
+ * ]
+ * 
+ * 
+ *
+ * Note: messageHeaderRegex could be specified as lists also e.g.
+ *
+ * 
+ * 
+ * "messageHeaderRegex": [
+ * "regular expression 1",
+ * "regular expression 2"
+ * ]
+ * 
+ * 
+ *
+ * Where regular expression 1 are valid regular 
expressions and may have named
+ * groups, which would be extracted into fields. This list will be 
evaluated in order until a
+ * matching regular expression is found.
+ * 
+ *
+ * Configuration fields explanation
+ *
+ * 
+ * recordTypeRegex : used to specify a regular expression to distinctly 
identify a record type.
+ * messageHeaderRegex :  used to specify a regular expression to extract 
fields from a message part which is common across all the messages.
+ * e.g. rhel logs looks like
+ * 
+ * <7>Jun 26 16:18:01 hostName kernel: SELinux: initialized (dev tmpfs, 
type tmpfs), uses transition SIDs
+ * 
+ * 
+ * 
+ *
+ * Here message structure (<7>Jun 26 16:18:01 hostName kernel) is common 
across all messages.
+ * Hence messageHeaderRegex could be used to extract fields from this part.
+ *
+ * fields : json list of objects containing recordType and regex. regex 
could be a further list e.g.
+ *
+ * 
+ * 
+ * "regex":  [ "record type specific regular expression 1",
+ * "record type specific regular expression 2"]
+ *
+ * 
+ * 
+ *
+ * Limitation 
+ * Currently the named groups in java regular expressions have a 
limitation. Only following
+ * characters could be used to name a named group. A capturing group can 
also be assigned a "name",
+ * a named-capturing group, and then be back-reference

[GitHub] metron pull request #1245: METRON-1795: Initial Commit for Regular Expressio...

2018-10-23 Thread jagdeepsingh2
GitHub user jagdeepsingh2 opened a pull request:

https://github.com/apache/metron/pull/1245

METRON-1795: Initial Commit for Regular Expressions Parser

## Contributor Comments
Contributing a new general purpose regular expressions based parser.


## Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.  
Please refer to our [Development 
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
 for the complete guide to follow for contributions.  
Please refer also to our [Build Verification 
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
 for complete smoke testing guides.  


In order to streamline the review of the contribution we ask you follow 
these guidelines and ask you to double check the following:

### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
**Yes. Jira created for this PR. 
https://issues.apache.org/jira/browse/METRON-1795**
- [ ] Does your PR title start with METRON- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
**Yes.**
- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
**Yes**


### For code changes:
- [ ] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
**N/A as this  PR is for a new feature.** 
- [ ] Have you included steps or a guide to how the change may be verified 
and tested manually?
**Yes. Included Junit can be used to test the new parser.**
- [ ] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
  ```
  mvn -q clean integration-test install && 
dev-utilities/build-utils/verify_licenses.sh 
  ```
**Yes.**
- [ ] Have you written or updated unit tests and or integration tests to 
verify your changes?
**I have included the unit tests.**
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
**N/A**
- [ ] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?
**Yes**
### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered by building and verifying the site-book? If not then run 
the following commands and the verify changes via 
`site-book/target/site/index.html`:

  ```
  cd site-book
  mvn site
  ```
**Yes.**

 Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up 
for your personal repository such that your branches are built there before 
submitting a pull request.

Note: This is a follow up for an earlier PR for METRON-1795, which was 
created and subsequently closed due to corrupted git commits history. Following 
comments were posted in earlier PR which I am posting here again with my 
disposition.

@nickwallen commented 27 days ago
Thanks for the contribution @jagdeepsingh2. To take this any further we 
need at a minimum the following items.

**An explanation of what itch this scratches (Why is this needed over Grok 
parser?)**
This question was answered in the associated jira ticket 
(https://issues.apache.org/jira/browse/METRON-1795). In a nutshell 
Allow for more advanced parsing scenarios (specifically, dealing with 
multiple regex lines for devices that contain several log formats within them)
Give users and developers of Metron additional options for parsing
With the new parser chaining and regex routing feature available in Metron, 
this gives some additional flexibility to logically separate a flow by:
Regex routing to segregate logs at a device level and handle envelope 
unwrapping
This general purpose regex parser to parse an entire device type that 
contains multiple log formats within the single device (for example, RHEL logs)

Also, as per GrokParser documentation 
(https://cwiki.apache.org/confluence/display/METRON/Parsing+Topology) it is 
intended for low volume scenarios only, while we have tested this parser 
(RegularExpressionsParser) in very high volume scenarios also.

**Documented Instructions on how to use your parser. Include a README.md in 
your code contribution.**
I have updated 

[GitHub] metron pull request #1214: METRON-1795 Initial commit for a general purpose ...

2018-10-23 Thread jagdeepsingh2
Github user jagdeepsingh2 closed the pull request at:

https://github.com/apache/metron/pull/1214


---


[GitHub] metron issue #1214: METRON-1795 Initial commit for a general purpose regular...

2018-10-23 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on the issue:

https://github.com/apache/metron/pull/1214
  
Closing this PR because of corrupted git commits history. I will create a 
new PR for this.


---


[GitHub] metron issue #1214: METRON-1795 Initial commit for a general purpose regular...

2018-10-04 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on the issue:

https://github.com/apache/metron/pull/1214
  
Yeah, I performed a rebase yesterday as I had to pull the latest changes 
from upstream. What is the best way out? Should I discard this PR and create a 
fresh and clean PR?


---


[GitHub] metron pull request #1222: Updated the Readme.md for Regular expressions par...

2018-10-03 Thread jagdeepsingh2
Github user jagdeepsingh2 closed the pull request at:

https://github.com/apache/metron/pull/1222


---


[GitHub] metron pull request #1222: Updated the Readme.md for Regular expressions par...

2018-10-03 Thread jagdeepsingh2
GitHub user jagdeepsingh2 opened a pull request:

https://github.com/apache/metron/pull/1222

Updated the Readme.md for Regular expressions parser.

Configuration field explanation for regular expressions parser.

## Contributor Comments
[Please place any comments here.  A description of the problem/enhancement, 
how to reproduce the issue, your testing methodology, etc.]


## Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.  
Please refer to our [Development 
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
 for the complete guide to follow for contributions.  
Please refer also to our [Build Verification 
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
 for complete smoke testing guides.  


In order to streamline the review of the contribution we ask you follow 
these guidelines and ask you to double check the following:

### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [ ] Does your PR title start with METRON- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?


### For code changes:
- [ ] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
- [ ] Have you included steps or a guide to how the change may be verified 
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
  ```
  mvn -q clean integration-test install && 
dev-utilities/build-utils/verify_licenses.sh 
  ```

- [ ] Have you written or updated unit tests and or integration tests to 
verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?

### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered by building and verifying the site-book? If not then run 
the following commands and the verify changes via 
`site-book/target/site/index.html`:

  ```
  cd site-book
  mvn site
  ```

 Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up 
for your personal repository such that your branches are built there before 
submitting a pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jagdeepsingh2/metron patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/metron/pull/1222.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1222


commit f888fde0e9b0d46889749ce753aa0183531e20a2
Author: jagdeepsingh2 <43630622+jagdeepsingh2@...>
Date:   2018-10-04T02:45:52Z

Updated the Readme.md for Regular expressions parser.

Configuration field explanation for regular expressions parser.




---


[GitHub] metron issue #1222: Updated the Readme.md for Regular expressions parser.

2018-10-03 Thread jagdeepsingh2
Github user jagdeepsingh2 commented on the issue:

https://github.com/apache/metron/pull/1222
  
This PR needs to be ignored.


---


[GitHub] metron pull request #1214: METRON-1795 Initial commit for a general purpose ...

2018-09-27 Thread jagdeepsingh2
GitHub user jagdeepsingh2 opened a pull request:

https://github.com/apache/metron/pull/1214

METRON-1795 Initial commit for a general purpose regular expressions …

…parser.

## Contributor Comments
[Please place any comments here.  A description of the problem/enhancement, 
how to reproduce the issue, your testing methodology, etc.]


## Pull Request Checklist

Thank you for submitting a contribution to Apache Metron.  
Please refer to our [Development 
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
 for the complete guide to follow for contributions.  
Please refer also to our [Build Verification 
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
 for complete smoke testing guides.  


In order to streamline the review of the contribution we ask you follow 
these guidelines and ask you to double check the following:

### For all changes:
- [ ] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [ ] Does your PR title start with METRON- where  is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
- [ ] Has your PR been rebased against the latest commit within the target 
branch (typically master)?


### For code changes:
- [ ] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
- [ ] Have you included steps or a guide to how the change may be verified 
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
  ```
  mvn -q clean integration-test install && 
dev-utilities/build-utils/verify_licenses.sh 
  ```

- [ ] Have you written or updated unit tests and or integration tests to 
verify your changes?
- [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?

### For documentation related changes:
- [ ] Have you ensured that format looks appropriate for the output in 
which it is rendered by building and verifying the site-book? If not then run 
the following commands and the verify changes via 
`site-book/target/site/index.html`:

  ```
  cd site-book
  mvn site
  ```

 Note:
Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up 
for your personal repository such that your branches are built there before 
submitting a pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jagdeepsingh2/metron master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/metron/pull/1214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1214


commit ed56a3afd135c84e2cab52dda969c8e112f43602
Author: jagdeep 
Date:   2018-09-27T07:37:19Z

METRON-1795 Initial commit for a general purpose regular expressions parser.




---