Thank you to you both,  Jorge and Mich.
You've answered my questions in a quasi-realtime manner!
I will look into Flume and HDFS.

> Le 22 févr. 2016 à 22:41, Jorge Machado <[email protected]> a écrit :
> 
>  To Get the that you could use Flume to ship the logs from the Servers to the 
> HDFS  for example and to streaming on it. 
> 
> Check this :  
> http://spark.apache.org/docs/latest/streaming-flume-integration.html and 
> http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/
>   or 
> http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/
> 
> 
> 
>> On 22/02/2016, at 22:36, Mich Talebzadeh 
>> <[email protected]> wrote:
>> 
>> Hi,
>> 
>> There are a number of options here.
>> 
>> You first point of call would be to store these logs that come in from the 
>> source on HDFS directory as time series entries. I assume the logs will be 
>> in textual format and will be compressed (gzip, bzip2 etc).They can be 
>> stored individually and you can run timely jobs to analyse the data. I did 
>> something similar recently trying to show case by getting data from an 
>> Oracle log file stored as a compressed file in HDFS and look for ORA- errors 
>> and will group them using standard SQL. In general, this can be applied to 
>> any log file. From the heap that I will call rdd, I will create a DataFrame 
>> called df and then I will create a temporary relational table using another 
>> method called registerTempTable(). That method call creates an in-memory 
>> table that I will call tmp that is scoped to the cluster in which it was 
>> created. The data is stored using Hive's highly-optimized, in-memory 
>> columnar format.
>> 
>>  I will use spark-shell, one of the main tools that come with Spark for this 
>> work. My Spark version is 1.5.2 and the shell is Scala here
>> 
>> //Looking at Oracle log file stored in /test directory in HDFS. Create the 
>> rdd for it
>> 
>> //
>> 
>> val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz")
>> 
>> //
>> 
>> //Convert the heap to DataFrame called df in the form of a string
>> 
>> //
>> 
>> val df = rdd.toDF("string")
>> 
>> //
>> 
>> // Register this string as a temporary table called tmp
>> 
>> //
>> 
>> df.registerTempTable("tmp")
>> 
>> //
>> 
>> //Run standard SQL to look for ‘%ORA-%’ errors
>> 
>> //
>> 
>> sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp 
>> WHERE string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY Count 
>> DESC LIMIT 10").show()
>> 
>> +-----------+-----+
>> 
>> |      Error|Count|
>> 
>> +-----------+-----+
>> 
>> |ORA-19815: |35798|
>> 
>> |ORA-19809: |35793|
>> 
>> |ORA-19804: |35784|
>> 
>> |ORA-02063: |  435|
>> 
>> |ORA-28500: |  432|
>> 
>> |ORA-00443: |  157|
>> 
>> |ORA-00312: |   22|
>> 
>> |ORA-16038: |    9|
>> 
>> |ORA-1652: u|    6|
>> 
>> |ORA-27061: |    3|
>> 
>> +-----------+-----+
>> 
>>  
>>   HTH,
>> 
>>> Hello,
>>> I have a few newbie questions regarding Spark.
>>> Is Spark a good tool to process Web logs for attacks (or is it better to 
>>> used a more specialized tool)? If so, are there any plugins for this 
>>> purpose?
>>> Can you use Spark to weed out huge logs and extract only suspicious 
>>> activities; e.g., 1000 attempts to connect to a particular host within a 
>>> time bracket?
>>> Many thanks.
>>> Cheers,
>>> Philippe
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>  
>>  
>> -- 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>> 
> 

Reply via email to