Re: Newbie questions regarding log processing

Jorge Machado Mon, 22 Feb 2016 13:42:06 -0800

 To Get the that you could use Flume to ship the logs from the Servers to the 
HDFS  for example and to streaming on it.


Check this :  
http://spark.apache.org/docs/latest/streaming-flume-integration.html 
<http://spark.apache.org/docs/latest/streaming-flume-integration.html> and 
http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/
 
<http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/>
  or 
http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/
 
<http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/>



> On 22/02/2016, at 22:36, Mich Talebzadeh 
> <[email protected]> wrote:
> 
> Hi,
> 
> There are a number of options here.
> 
> You first point of call would be to store these logs that come in from the 
> source on HDFS directory as time series entries. I assume the logs will be in 
> textual format and will be compressed (gzip, bzip2 etc).They can be stored 
> individually and you can run timely jobs to analyse the data. I did something 
> similar recently trying to show case by getting data from an Oracle log file 
> stored as a compressed file in HDFS and look for ORA- errors and will group 
> them using standard SQL. In general, this can be applied to any log file. 
> From the heap that I will call rdd, I will create a DataFrame called df and 
> then I will create a temporary relational table using another method called 
> registerTempTable(). That method call creates an in-memory table that I will 
> call tmp that is scoped to the cluster in which it was created. The data is 
> stored using Hive's highly-optimized, in-memory columnar format.
> 
>  I will use spark-shell, one of the main tools that come with Spark for this 
> work. My Spark version is 1.5.2 and the shell is Scala here
> 
> //Looking at Oracle log file stored in /test directory in HDFS. Create the 
> rdd for it
> 
> //
> 
> val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz")
> 
> //
> 
> //Convert the heap to DataFrame called df in the form of a string
> 
> //
> 
> val df = rdd.toDF("string")
> 
> //
> 
> // Register this string as a temporary table called tmp
> 
> //
> 
> df.registerTempTable("tmp")
> 
> //
> 
> //Run standard SQL to look for ‘%ORA-%’ errors
> 
> //
> 
> sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp WHERE 
> string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY Count DESC 
> LIMIT 10").show()
> 
> +-----------+-----+
> 
> |      Error|Count|
> 
> +-----------+-----+
> 
> |ORA-19815: |35798|
> 
> |ORA-19809: |35793|
> 
> |ORA-19804: |35784|
> 
> |ORA-02063: |  435|
> 
> |ORA-28500: |  432|
> 
> |ORA-00443: |  157|
> 
> |ORA-00312: |   22|
> 
> |ORA-16038: |    9|
> 
> |ORA-1652: u|    6|
> 
> |ORA-27061: |    3|
> 
> +-----------+-----+
> 
>  
>   HTH,
> 
>> Hello,
>> I have a few newbie questions regarding Spark.
>> Is Spark a good tool to process Web logs for attacks (or is it better to 
>> used a more specialized tool)? If so, are there any plugins for this purpose?
>> Can you use Spark to weed out huge logs and extract only suspicious 
>> activities; e.g., 1000 attempts to connect to a particular host within a 
>> time bracket?
>> Many thanks.
>> Cheers,
>> Philippe
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>> 
>  
>  
> -- 
> Dr Mich Talebzadeh
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>

Re: Newbie questions regarding log processing

Reply via email to