On 14 December 2010 14:28, Andrew Dunbar <[email protected]> wrote:
> The dump site (http://download.wikimedia.org/) is still broken at the > moment but another way to build some word frequency data is by > randomly sampling the wikis for the languages you are interested in. > At least these Indic languages have Wikipedias of varying sizes: > > Assamese http://as.wikipedia.org > Bihari http://bh.wikipedia.org > Bengali http://bn.wikipedia.org > Bishnupriya Manipuri http://bpy.wikipedia.org > Gujarati http://gu.wikipedia.org > Hindi http://hi.wikipedia.org > Kannada http://kn.wikipedia.org > Kashmiri http://ks.wikipedia.org > Marathi http://mr.wikipedia.org > Nepali http://ne.wikipedia.org > Nepal Bhasa http://new.wikipedia.org > Oriya http://or.wikipedia.org/wiki > Eastern Punjabi http://pa.wikipedia.org > Western Punjabi http://pnb.wikipedia.org > Sanskrit http://sa.wikipedia.org > Sindhi http://sd.wikipedia.org > Tamil http://ta.wikipedia.org > Telugu http://te.wikipedia.org > Urdu http://ur.wikipedia.org > > If you'd like to use it I have a tool that downloads random samples of > wiki pages and strips the HTML for purposes such as this. > Yeah, let me know, that will be very useful Thanks, Pravin s _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
