woow, great post, very detailed, question is that, what kind of "web logs"
do they have, if those logs are some application logs, like apache httpd
logs or oracle logs, then, sure, this is a typical use cases for spark or
generally, for hadoop tech stack.

but if Philippe is talking about network attack, then there will be
probably some network traffic logs, such as pcap files, afaik spark or
other tools in hadoop stack do not have well known plugins for pcap file
format yet.


2016-02-22 22:36 GMT+01:00 Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk>:

> Hi,
>
> There are a number of options here.
>
> You first point of call would be to store these logs that come in from the
> source on HDFS directory as time series entries. I assume the logs will be
> in textual format and will be compressed (gzip, bzip2 etc).They can
> be stored individually and you can run timely jobs to analyse the data. I
> did something similar recently trying to show case by getting data from an
> Oracle log file stored as a compressed file in HDFS and look for ORA-
> errors and will group them using standard SQL. In general, this can be
> applied to any log file. From the heap that I will call *rdd,* I will
> create a DataFrame called *df* and then I will create a temporary
> relational table using another method called registerTempTable(). That
> method call creates an in-memory table that I will call *tmp* that is
> scoped to the cluster in which it was created. The data is stored using
> Hive's highly-optimized, in-memory columnar format.
>
>  I will use spark-shell, one of the main tools that come with Spark for
> this work. My Spark version is 1.5.2 and the shell is Scala here
>
> //Looking at Oracle log file stored in /test directory in HDFS. Create the
> rdd for it
>
> //
>
> val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz")
>
> //
>
> //Convert the heap to DataFrame called df in the form of a string
>
> //
>
> val df = rdd.toDF("string")
>
> //
>
> // Register this string as a temporary table called tmp
>
> //
>
> df.registerTempTable("tmp")
>
> //
>
> //Run standard SQL to look for ‘%ORA-%’ errors
>
> //
>
> sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp
> WHERE string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY Count
> DESC LIMIT 10").show()
>
> +-----------+-----+
>
> |      Error|Count|
>
> +-----------+-----+
>
> |ORA-19815: |35798|
>
> |ORA-19809: |35793|
>
> |ORA-19804: |35784|
>
> |ORA-02063: |  435|
>
> |ORA-28500: |  432|
>
> |ORA-00443: |  157|
>
> |ORA-00312: |   22|
>
> |ORA-16038: |    9|
>
> |ORA-1652: u|    6|
>
> |ORA-27061: |    3|
>
> +-----------+-----+
>
>
>
>   HTH,
>
> Hello,
> I have a few newbie questions regarding Spark.
> Is Spark a good tool to process Web logs for attacks (or is it better to used 
> a more specialized tool)? If so, are there any plugins for this purpose?
> Can you use Spark to weed out huge logs and extract only suspicious 
> activities; e.g., 1000 attempts to connect to a particular host within a time 
> bracket?
> Many thanks.
> Cheers,
> Philippe
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>

Reply via email to