Feasibility Project - Text Processing and Category Classification
Hello, This my first post, so i would like to congratulate the spark team for the great work! In short i have been studying Spark for the past week in order to create a feasibility project. The main goal of the project is to process text documents (word count will not be over 200 words) in order to find specific keywords: i.e a log file can have Error, Warn, Info Once the keywords are found i would like to categorize them i.e: Error-Level 1 Warn-Level 2 Info -Level 3 The question is what is the best approach in order to solve my problem? Thanks for the help! Best Regards, Darksu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feasibility-Project-Text-Processing-and-Category-Classification-tp24493.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Feasibility Project - Text Processing and Category Classification
Load the textFile as an RDD. Something like this: val file = sc.textFile(/path/to/file) After this you can manipulate this RDD to filter texts the way you want them : val a1 = file.filter( line = line.contains([ERROR]) ) val a2 = file.filter( line = line.contains([WARN]) ) val a3 = file.filter( line = line.contains([INFO]) ) You can view the lines using the println method like this: a1.foreach(println) You can also count the number of such lines using the count function like this: val b1 = file.filter( line = line.contains([ERROR]) ).count() Regards, *Ritesh Kumar Singh,**https://riteshtoday.wordpress.com/ https://riteshtoday.wordpress.com/*
Re: Feasibility Project - Text Processing and Category Classification
I think there is already an example for this shipped with Spark. However, you do not benefit really from any spark functionality for this scenario. If you want to do something more advanced you should look at Elasticsearch or Solr Le ven. 28 août 2015 à 16:15, Darksu nick_tou...@hotmail.com a écrit : Hello, This my first post, so i would like to congratulate the spark team for the great work! In short i have been studying Spark for the past week in order to create a feasibility project. The main goal of the project is to process text documents (word count will not be over 200 words) in order to find specific keywords: i.e a log file can have Error, Warn, Info Once the keywords are found i would like to categorize them i.e: Error-Level 1 Warn-Level 2 Info -Level 3 The question is what is the best approach in order to solve my problem? Thanks for the help! Best Regards, Darksu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feasibility-Project-Text-Processing-and-Category-Classification-tp24493.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org