Hi, Thanks for your reply. I have got the table of contents, meta-data, title, author, etc for the books. Can you please tell me the next steps to proceed. I have read in Mahout In Action book that there are few tools available for vectorization Ex: Lucene analyzers, Mahout vector encoders Can you please tell me which is good and how to use it.?
Thanks, Suresh On 16 January 2014 14:49, Saeed Iqbal KhattaK <[email protected]>wrote: > Dear Suresh, > > I am also working in Classification of books. > > First of all I collect a meta-data of my e-books, after collecting a > meta-data than I start my second level to pre-process an e-book. In > pre-processing, I collect information regarding *books title, chapter > titles sections, subsection paragraph, sub-paragraph and Bold fonts* etc. > and remove all other formatted style than i got a result. > > > > > On Thu, Jan 16, 2014 at 2:09 PM, Ted Dunning <[email protected]> > wrote: > > > You generally want to do linguistic pre-processing (finding phrases, > > synonymizing certain forms such as abbreviations, tokenizing, dropping > stop > > words, removing boilerplate, removing tables) before doing vectorization. > > Altogether, these form pre-processing. > > > > To classify books, you need to recognize that many books are about many > > topics. You may want to segment your books down to the chapter, section > or > > even paragraph level. > > > > > > > > On Wed, Jan 15, 2014 at 10:25 PM, Suresh M <[email protected]> > > wrote: > > > > > Hi, > > > > > > Can you please tell me what does that pre-processing mean? Is it > > > vectorization(as explained in Mahout in Action book) > > > Can it be done using java and Mahout AP ? > > > And, the model means, is it a class? > > > > > > > > > > > > > > > On 16 January 2014 11:38, KK R <[email protected]> wrote: > > > > > > > Hi Suresh, > > > > > > > > Apache Mahout has certain classification algorithms which you can use > > to > > > do > > > > the classifcation. > > > > > > > > Step 1: Your data may require any pre-processing. If so, it can be > done > > > > using Hadoop / Hive / Mahout utilities. > > > > > > > > Step 2: Run classification algorithm on your training data and build > > your > > > > model using Mahout classification algorithms. > > > > > > > > Step 3: When the actual data comes, it needs to be classified with > the > > > help > > > > of trained model. This can be done sequentially in java or mapreduce > > can > > > be > > > > used if the size of the data is huge and scalability is a > requirement. > > > > > > > > Thanks, > > > > Kirubakumaresh > > > > @http://www.linkedin.com/pub/kirubakumaresh-rajendran/66/411/305 > > > > > > > > > > > > On Thu, Jan 16, 2014 at 11:28 AM, Suresh M <[email protected]> > > > > wrote: > > > > > > > > > Hi, > > > > > Our application will be getting books from different users. > > > > > We have to classify them accordingly. > > > > > Some one please tell me how to do that using apache mahout and > java. > > > > > Is hadoop necessary for that? > > > > > > > > > > > > > > > -- > > > > > Thank &Regards > > > > > Suresh > > > > > > > > > > > > > > > > > > -- > *Saeed Iqbal KhattaK* > Lecturer (FoIT) -- University of Central Punjab, Lahore > Tel: +92-42-35880007 - (ext 194) > MS CS, FAST-NUCES, Peshawar > BS IT (Hons), Punjab University College of Information Technology (PUCIT), > University Of The Punjab, Lahore. > http://saeedkhattak.wordpress.com > Cell No # +92-333-9533493 >
