Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Darksu
Hello,

This my first post, so i would like to congratulate the spark team for the
great work!

In short i have been studying Spark for the past week in order to create a
feasibility project.

The main goal of the project is to process text documents (word count will
not be over 200 words) in order to find specific keywords:

i.e a log file can have Error, Warn, Info

Once the keywords are found i would like to categorize them i.e:

Error-Level 1
Warn-Level 2
Info  -Level 3

The question is what is the best approach in order to solve my problem? 

Thanks for the help!

Best Regards,

Darksu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Feasibility-Project-Text-Processing-and-Category-Classification-tp24493.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Ritesh Kumar Singh
Load the textFile as an RDD. Something like this:

  val file = sc.textFile(/path/to/file)


After this you can manipulate this RDD to filter texts the way you want
them :

  val a1 = file.filter( line = line.contains([ERROR]) )
  val a2 = file.filter( line = line.contains([WARN]) )
  val a3 = file.filter( line = line.contains([INFO]) )


You can view the lines using the println method like this:

  a1.foreach(println)


You can also count the number of such lines using the count function like
this:

  val b1 = file.filter( line = line.contains([ERROR]) ).count()


Regards,


 *Ritesh Kumar Singh,**https://riteshtoday.wordpress.com/
 https://riteshtoday.wordpress.com/*


Re: Feasibility Project - Text Processing and Category Classification

2015-08-28 Thread Jörn Franke
I think there is already an example for this shipped with Spark. However,
you do not benefit really from any spark functionality for this scenario.
If you want to do something more advanced you should look at Elasticsearch
or Solr

Le ven. 28 août 2015 à 16:15, Darksu nick_tou...@hotmail.com a écrit :

 Hello,

 This my first post, so i would like to congratulate the spark team for the
 great work!

 In short i have been studying Spark for the past week in order to create a
 feasibility project.

 The main goal of the project is to process text documents (word count will
 not be over 200 words) in order to find specific keywords:

 i.e a log file can have Error, Warn, Info

 Once the keywords are found i would like to categorize them i.e:

 Error-Level 1
 Warn-Level 2
 Info  -Level 3

 The question is what is the best approach in order to solve my problem?

 Thanks for the help!

 Best Regards,

 Darksu



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Feasibility-Project-Text-Processing-and-Category-Classification-tp24493.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org