Hi, 

There are a number of options here. 

You first point of call would be to store these logs that come in from
the source on HDFS directory as time series entries. I assume the logs
will be in textual format and will be compressed (gzip, bzip2 etc).They
can be stored individually and you can run timely jobs to analyse the
data. I did something similar recently trying to show case by getting
data from an Oracle log file stored as a compressed file in HDFS and
look for ORA- errors and will group them using standard SQL. In general,
this can be applied to any log file. From the heap that I will call
_rdd,_ I will create a DataFrame called _df_ and then I will create a
temporary relational table using another method called
registerTempTable(). That method call creates an in-memory table that I
will call _tmp_ that is scoped to the cluster in which it was created.
The data is stored using Hive's highly-optimized, in-memory columnar
format. 

 I will use spark-shell, one of the main tools that come with Spark for
this work. My Spark version is 1.5.2 and the shell is Scala here 

//Looking at Oracle log file stored in /test directory in HDFS. Create
the rdd for it 

// 

val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz") 

// 

//Convert the heap to DataFrame called df in the form of a string 

// 

val df = rdd.toDF("string") 

// 

// Register this string as a temporary table called tmp 

// 

df.registerTempTable("tmp") 

// 

//Run standard SQL to look for '%ORA-%' errors 

// 

sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp
WHERE string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY
Count DESC LIMIT 10").show() 

+-----------+-----+ 

| Error|Count| 

+-----------+-----+ 

|ORA-19815: |35798| 

|ORA-19809: |35793| 

|ORA-19804: |35784| 

|ORA-02063: | 435| 

|ORA-28500: | 432| 

|ORA-00443: | 157| 

|ORA-00312: | 22| 

|ORA-16038: | 9| 

|ORA-1652: u| 6| 

|ORA-27061: | 3| 

+-----------+-----+ 

 HTH, 

> Hello,
> I have a few newbie questions regarding Spark.
> Is Spark a good tool to process Web logs for attacks (or is it better to used 
> a more specialized tool)? If so, are there any plugins for this purpose?
> Can you use Spark to weed out huge logs and extract only suspicious 
> activities; e.g., 1000 attempts to connect to a particular host within a time 
> bracket?
> Many thanks.
> Cheers,
> Philippe
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Reply via email to