To Get the that you could use Flume to ship the logs from the Servers to the HDFS for example and to streaming on it.
Check this : http://spark.apache.org/docs/latest/streaming-flume-integration.html <http://spark.apache.org/docs/latest/streaming-flume-integration.html> and http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/ <http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/> or http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/ <http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/> > On 22/02/2016, at 22:36, Mich Talebzadeh > <[email protected]> wrote: > > Hi, > > There are a number of options here. > > You first point of call would be to store these logs that come in from the > source on HDFS directory as time series entries. I assume the logs will be in > textual format and will be compressed (gzip, bzip2 etc).They can be stored > individually and you can run timely jobs to analyse the data. I did something > similar recently trying to show case by getting data from an Oracle log file > stored as a compressed file in HDFS and look for ORA- errors and will group > them using standard SQL. In general, this can be applied to any log file. > From the heap that I will call rdd, I will create a DataFrame called df and > then I will create a temporary relational table using another method called > registerTempTable(). That method call creates an in-memory table that I will > call tmp that is scoped to the cluster in which it was created. The data is > stored using Hive's highly-optimized, in-memory columnar format. > > I will use spark-shell, one of the main tools that come with Spark for this > work. My Spark version is 1.5.2 and the shell is Scala here > > //Looking at Oracle log file stored in /test directory in HDFS. Create the > rdd for it > > // > > val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz") > > // > > //Convert the heap to DataFrame called df in the form of a string > > // > > val df = rdd.toDF("string") > > // > > // Register this string as a temporary table called tmp > > // > > df.registerTempTable("tmp") > > // > > //Run standard SQL to look for ‘%ORA-%’ errors > > // > > sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp WHERE > string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY Count DESC > LIMIT 10").show() > > +-----------+-----+ > > | Error|Count| > > +-----------+-----+ > > |ORA-19815: |35798| > > |ORA-19809: |35793| > > |ORA-19804: |35784| > > |ORA-02063: | 435| > > |ORA-28500: | 432| > > |ORA-00443: | 157| > > |ORA-00312: | 22| > > |ORA-16038: | 9| > > |ORA-1652: u| 6| > > |ORA-27061: | 3| > > +-----------+-----+ > > > HTH, > >> Hello, >> I have a few newbie questions regarding Spark. >> Is Spark a good tool to process Web logs for attacks (or is it better to >> used a more specialized tool)? If so, are there any plugins for this purpose? >> Can you use Spark to weed out huge logs and extract only suspicious >> activities; e.g., 1000 attempts to connect to a particular host within a >> time bracket? >> Many thanks. >> Cheers, >> Philippe >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> <mailto:[email protected]> >> For additional commands, e-mail: [email protected] >> <mailto:[email protected]> >> > > > -- > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Cloud Technology Partners > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus free, > therefore neither Cloud Technology partners Ltd, its subsidiaries nor their > employees accept any responsibility. >
