[R] R Parse HTML tabular data and apply NLP

Anshuk Pal Chaudhuri Wed, 29 Jul 2015 23:56:03 -0700

Hi All,

I have quite a few files which is having HTML tabular data. All the files have 
have different format, numerous nested tables and different information and the 
table structure is completely different. The only common thing in these files 
is that they are in tables.


I was able to read the table using the readHTMLTable function. e.g one file has 
23 tables, able to put all data one data frame. Obviously, the read function is 
not able to interpret the header obviously (which is also it not supposed to), 
hence creating creating variables like V1, V2..

Now when I have got all the text into a dataframe (the data is scattered in 
different columns), how do I interpret the text using machine learning to train 
that this text (sentence,word..)means this, or this text means this. Basically, 
automatic categorization of the all the text in the dataframe.

I was reading about RTextTools (http://www.rtexttools.com/), well in that case 
it has be told that this value is for this text and hence further...

Any help would be appreciated.

Regards,
Anshuk Pal Chaudhuri


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] R Parse HTML tabular data and apply NLP

Reply via email to