Thank you to you both, Jorge and Mich. You've answered my questions in a quasi-realtime manner! I will look into Flume and HDFS.
> Le 22 févr. 2016 à 22:41, Jorge Machado <[email protected]> a écrit : > > To Get the that you could use Flume to ship the logs from the Servers to the > HDFS for example and to streaming on it. > > Check this : > http://spark.apache.org/docs/latest/streaming-flume-integration.html and > http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-server-log-data/ > or > http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/ > > > >> On 22/02/2016, at 22:36, Mich Talebzadeh >> <[email protected]> wrote: >> >> Hi, >> >> There are a number of options here. >> >> You first point of call would be to store these logs that come in from the >> source on HDFS directory as time series entries. I assume the logs will be >> in textual format and will be compressed (gzip, bzip2 etc).They can be >> stored individually and you can run timely jobs to analyse the data. I did >> something similar recently trying to show case by getting data from an >> Oracle log file stored as a compressed file in HDFS and look for ORA- errors >> and will group them using standard SQL. In general, this can be applied to >> any log file. From the heap that I will call rdd, I will create a DataFrame >> called df and then I will create a temporary relational table using another >> method called registerTempTable(). That method call creates an in-memory >> table that I will call tmp that is scoped to the cluster in which it was >> created. The data is stored using Hive's highly-optimized, in-memory >> columnar format. >> >> I will use spark-shell, one of the main tools that come with Spark for this >> work. My Spark version is 1.5.2 and the shell is Scala here >> >> //Looking at Oracle log file stored in /test directory in HDFS. Create the >> rdd for it >> >> // >> >> val rdd = sc.textFile("hdfs://rhes564:9000/test/alert_mydb.log.gz") >> >> // >> >> //Convert the heap to DataFrame called df in the form of a string >> >> // >> >> val df = rdd.toDF("string") >> >> // >> >> // Register this string as a temporary table called tmp >> >> // >> >> df.registerTempTable("tmp") >> >> // >> >> //Run standard SQL to look for ‘%ORA-%’ errors >> >> // >> >> sql("SELECT SUBSTRING(string,1,11) AS Error, count(1) AS Count FROM tmp >> WHERE string LIKE '%ORA-%' GROUP BY SUBSTRING(string,1,11) ORDER BY Count >> DESC LIMIT 10").show() >> >> +-----------+-----+ >> >> | Error|Count| >> >> +-----------+-----+ >> >> |ORA-19815: |35798| >> >> |ORA-19809: |35793| >> >> |ORA-19804: |35784| >> >> |ORA-02063: | 435| >> >> |ORA-28500: | 432| >> >> |ORA-00443: | 157| >> >> |ORA-00312: | 22| >> >> |ORA-16038: | 9| >> >> |ORA-1652: u| 6| >> >> |ORA-27061: | 3| >> >> +-----------+-----+ >> >> >> HTH, >> >>> Hello, >>> I have a few newbie questions regarding Spark. >>> Is Spark a good tool to process Web logs for attacks (or is it better to >>> used a more specialized tool)? If so, are there any plugins for this >>> purpose? >>> Can you use Spark to weed out huge logs and extract only suspicious >>> activities; e.g., 1000 attempts to connect to a particular host within a >>> time bracket? >>> Many thanks. >>> Cheers, >>> Philippe >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> >> -- >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Cloud Technology >> Partners Ltd, its subsidiaries or their employees, unless expressly so >> stated. It is the responsibility of the recipient to ensure that this email >> is virus free, therefore neither Cloud Technology partners Ltd, its >> subsidiaries nor their employees accept any responsibility. >> >
