Well, I am going to come out of the closet here and admit that I for one will sometimes want to read that machine-generated text over the human-written English one. Sometimes to uncover the real little gems of Wikipedia, you need to have a lot of patience with Google translate options.
2013/4/26, Delirium <delir...@hackish.org>: > This is a very interesting proposal. I think how well it will work may > vary considerably based on the language. > > The strongest case in favor of machine-generating stubs, imo, is in > languages where there are many monolingual speakers and the Wikipedia is > already quite large and active. In that case, machine-generated stubs > can help promote expansion into not-yet-covered areas, plus provide > monolingual speakers with information they would otherwise either not > get, or have to get in worse form via a machine-translated article. > > At the other end of the spectrum you have quite small Wikipedias, and > Wikipedias which are both small and read/written mostly/entirely by > bilingual readers. In these Wikipedias, article-writing tends to focus > on things more specifically relevant to a certain culture and history. > Suddenly creating tens or hundreds of thousands of stubs in such > languages might serve to dilute a small Wikipedia more than strengthen > it: if you take a Wikipedia with 10,000 articles, and it gains 500,000 > machine-generated stubs, *almost every* article that comes up in search > engines will be machine-generated, making it much less obvious what > parts of the encyclopedia are actually active and human-written amidst > the sea of auto-generated content. > > Plus, from a reader's perspective, it may not even improve the > availability of information. For example, I doubt there are many > speakers of Bavarian who would prefer to read a machine-generated > bar.wiki article, over a human-written de.wiki article. That may even be > true for some less-related languages: most Danes I know would prefer a > human-written English article over a machine-generated Danish one. > > -Mark > > > On 4/25/13 8:16 PM, Erik Moeller wrote: >> Millions of Wikidata stubs invade small Wikipedias .. Volapük >> Wikipedia now best curated source on asteroids .. new editors flood >> small wikis .. Google spokesperson: "This is out of control. We will >> shut it down." >> >> Denny suggested: >> >>>> II ) develop a feature that blends into Wikipedia's search if an article >>>> about a topic does not exist yet, but we have data on Wikidata about >>>> that >>>> topic >> Andrew Gray responded: >> >>> I think this would be amazing. A software hook that says "we know X >>> article does not exist yet, but it is matched to Y topic on Wikidata" >>> and pulls out core information, along with a set of localised >>> descriptions... we gain all the benefit of having stub articles >>> (scope, coverage) without the problems of a small community having to >>> curate a million pages. It's not the same as hand-written content, but >>> it's immeasurably better than no content, or even an attempt at >>> machine-translating free text. >>> >>> XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos, >>> Vietnam]. It [grows to: 20 cm]. (pictures) >> This seems very doable. Is it desirable? >> >> For many languages, it would allow hundreds of thousands of >> pseudo-stubs (not real articles stored in the DB, but generated from >> Wikidata) to be served to readers and crawlers that would otherwise >> not exist in that language. >> >> Looking back 10 years, User:Ram-Man was one of the first to generate >> thousands of en.wp articles from, in this case, US census data. It was >> controversial at the time and it stuck. Other Wikipedias have since >> then either allowed or prohibited bot-creation of articles on a >> project-by-project basis. It tends to lead to frustration when folks >> compare article counts and see artificial inflation by bot-created >> content. >> >> Does anyone know if the impact of bot-creation on (new) editor >> behavior has been studied? I do know that many of the Rambot articles >> were expanded over time, and I suspect many wouldn't have been if they >> hadn't turned up in search engines in the first place. On the flip >> side, a large "surface area" of content being indexed by search >> engines will likely also attract a fair bit of drive-by vandalism that >> may not be detected because those pages aren't watched. >> >> A model like the proposed one might offer a solution to a lot of these >> challenges. How I imagine it could work: >> >> * Templates could be defined for different Wikidata entities. We could >> make it possible to let users add links from items in Wikidata to >> Wikipedia articles that don't exist yet. (Currently this is >> prohibited.) If such a link is added, _and_ a relevant template is >> defined for the Wikidata entity type (perhaps through an entity >> type->template mapping), WP will render an article using that >> template, pulling structured info from Wikidata. >> >> * A lot of the grammatical rules would be defined in the template >> using checks against the Wikidata result. Depending on the complexity >> of grammatical variations beyond basics such as singular/plural this >> might require Lua scripting. >> >> * The article is served as a normal HTTP 200 result, cached, and >> indexed by search engines. In WP itself, links to the article might >> have some special affordance that suggests that they're neither >> ordinary red links nor existing articles. >> >> * When a user tries to edit the article, wikitext (or visual edit >> mode) is generated, allowing the user to expand or add to the >> automatically generated prose and headings. Such edits are tagged so >> they can more easily be monitored (they could also be gated by default >> if the vandalism rate is too high). >> >> * We'd need to decide whether we want these pages to show up in >> searches on WP itself. >> >> Advantages: >> >> * These pages wouldn't inflate page counts, but they would offer >> useful information to readers and be higher quality than machine >> translation. >> >> * They could serve as powerful lures for new editors in languages that >> are currently underrepresented on the web. >> >> Disadvantages/concerns: >> >> * Depending on implementation, I continue to have some concern about >> {{#property}} references ending up in article text (as opposed to >> templates); these concerns are consistent with the ones expressed in >> the en.wp RFC [1]. This might be mitigated if Visual Editor offers a >> super-intuitive in-place editing method. {{#property}} references in >> text could also be converted to their plain text representation the >> moment a page is edited by a human being (which would have its own set >> of challenges, of course). >> >> * How massive would these sets of auto-generated articles get? I >> suspect the technical complexity of setting up the templates and >> adding the links in Wikidata itself would act as a bit of a barrier to >> entry. But vast pseudo-article sets in tiny languages could pose >> operational challenges without adding a lot of value. >> >> * Would search engines penalize WP for such auto-generated content? >> >> Overall, I think it's an area where experimentation is merited, as it >> could not only expand information in languages that are >> underrepresented on the web, but also act as a force multiplier for >> new editor entrypoints. It also seems that a proof-of-concept for >> experimentation in a limited context should be very doable. >> >> Erik >> >> [1] >> https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2#Use_of_Wikidata_in_article_text >> -- >> Erik Möller >> VP of Engineering and Product Development, Wikimedia Foundation >> >> _______________________________________________ >> Wikimedia-l mailing list >> Wikimedia-l@lists.wikimedia.org >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > > > _______________________________________________ > Wikimedia-l mailing list > Wikimedia-l@lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l > _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l