Hi,
There are a number of options here. You first point of call would be to store these logs that come in from the source on HDFS directory as time series entries. I assume the logs will be in textual format and will be compressed (gzip, bzip2 etc).They can be stored individually and you can run timely jobs to analyse the data. I did something similar recently trying to show case by getting data from an Oracle log file stored as a compressed file in HDFS and look for ORA- errors and will group them using standard SQL. In general, this can be applied to any log file. From the heap that I will call _rdd,_ I will create a DataFrame called _df_ and then I will create a temporary relational table using another method called registerTempTable(). That method call creates an in-memory table that I will call _tmp_ that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format. I will use spark-shell, one of the main tools that come with Spark for this work. My Spark version is 1.5.2 and the shell is Scala here //Looking at Oracle log file stored in /test directory in HDFS. Create the rdd for it // val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz") // //Convert the heap to DataFrame called df in the form of a string // val df = rdd.toDF("string") // // Register this string as a temporary table called tmp // df.registerTempTable("tmp") // //Run standard SQL to look for '%ORA-%' errors // sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp WHERE string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY Count DESC LIMIT 10").show() +-----------+-----+ | Error|Count| +-----------+-----+ |ORA-19815: |35798| |ORA-19809: |35793| |ORA-19804: |35784| |ORA-02063: | 435| |ORA-28500: | 432| |ORA-00443: | 157| |ORA-00312: | 22| |ORA-16038: | 9| |ORA-1652: u| 6| |ORA-27061: | 3| +-----------+-----+ HTH, > Hello, > I have a few newbie questions regarding Spark. > Is Spark a good tool to process Web logs for attacks (or is it better to used > a more specialized tool)? If so, are there any plugins for this purpose? > Can you use Spark to weed out huge logs and extract only suspicious > activities; e.g., 1000 attempts to connect to a particular host within a time > bracket? > Many thanks. > Cheers, > Philippe > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org -- Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Cloud Technology Partners Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Cloud Technology partners Ltd, its subsidiaries nor their employees accept any responsibility.