Re: [Wikitech-l] require language dump for developing words and corresponding frequency

[email protected] Tue, 14 Dec 2010 01:18:46 -0800

On 14 December 2010 14:28, Andrew Dunbar <[email protected]> wrote:


> The dump site (http://download.wikimedia.org/) is still broken at the
> moment but another way to build some word frequency data is by
> randomly sampling the wikis for the languages you are interested in.
> At least these Indic languages have Wikipedias of varying sizes:
>
> Assamese http://as.wikipedia.org
> Bihari http://bh.wikipedia.org
> Bengali http://bn.wikipedia.org
> Bishnupriya Manipuri http://bpy.wikipedia.org
> Gujarati http://gu.wikipedia.org
> Hindi http://hi.wikipedia.org
> Kannada http://kn.wikipedia.org
> Kashmiri http://ks.wikipedia.org
> Marathi http://mr.wikipedia.org
> Nepali http://ne.wikipedia.org
> Nepal Bhasa http://new.wikipedia.org
> Oriya http://or.wikipedia.org/wiki
> Eastern Punjabi http://pa.wikipedia.org
> Western Punjabi http://pnb.wikipedia.org
> Sanskrit http://sa.wikipedia.org
> Sindhi  http://sd.wikipedia.org
> Tamil http://ta.wikipedia.org
> Telugu http://te.wikipedia.org
> Urdu http://ur.wikipedia.org
>
> If you'd like to use it I have a tool that downloads random samples of
> wiki pages and strips the HTML for purposes such as this.
>


Yeah, let me know, that will be very useful

Thanks,
Pravin s
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] require language dump for developing words and corresponding frequency

Reply via email to