Hello Quinn / SuperGrey Here is my advice -
1. Select a few documents, either 1, 5, or 10, which are extra interesting and ideally which you can relate to Wikipedia articles 2. upload those documents to Wikimedia Commons 3. mirror / format into Wikisource 4. figure out how to do court citations. For English language, this is super hard. I have no experience with Chinese court citations. Get your citation data into Wikidata as part of https://meta.wikimedia.org/wiki/WikiCite 5. interconnect everything - Wikipedia, Wikimedia Commons, Wikidata, and Wikisource, in usual wiki ways 6. now come back and ask here again about doing this for 1000 more documents. You said you have tens of millions. The Wikimedia platform is not an exhaustive archive, and we probably only want documents which can be of general interest, but the Wikimedia platform is a good place for you to sort your process and showcase some select important set of these, whether that its 10s, 100s, 1000s of them, or whatever is interesting. Also in the Wikimedia platform you will be able to develop a general use data model for organizing these, for if and when you or anyone else find or create an appropriate complete archive. An appropriate on-wiki place to do your data modeling discussion is https://meta.wikimedia.org/wiki/Talk:WikiCite . There is an active WikiCite community and your project is a sort of document metadata sorting project, but we have never done legal documents there, nor do we have much Chinese language document curation. I think you have an interesting project and I would like to see you get at least 1 document into the Wikimedia platform as a demonstration. yours On Mon, Dec 8, 2025 at 5:09 AM Gerard Meijssen <[email protected]> wrote: > Hoi, > I wonder if this information is available at archive.org. If it is, > having it at wikisource is somewhat redundant. > Thanks, > GerardM > > On Sat, 6 Dec 2025 at 12:11, <[email protected]> wrote: > >> Hello everyone, >> >> I am Quinn (User:SuperGrey) from Chinese Wikisource (zh.wikisource.org). >> I am writing to request advice and precedent from the wider Wikisource >> community and the Wikimedia Foundation regarding a proposed large-scale >> import of Chinese court judgments from the national database known as China >> Judgments Online (中国裁判文书网, often abbreviated as CJO). >> >> I would like to begin with some background, because many non-Chinese >> Wikimedia contributors may not be aware of how significant CJO has been for >> judicial transparency in China and how sharply access to it has been >> reduced in recent years. >> >> China Judgments Online was launched in 2014 by the Supreme People’s Court >> (SPC) as a major transparency initiative. For nearly a decade, courts >> across the country uploaded tens of millions of decisions, creating what >> was widely regarded as one of the world’s largest publicly accessible >> judicial databases. At its peak, CJO hosted over 140 million documents and >> received tens of billions of page views. Researchers inside and outside >> China used the site extensively to study judicial behavior, local >> governance, criminal justice, and institutional changes. >> >> However, since around 2021, and especially in 2023–2024, the Chinese >> government has significantly reversed this openness. Multiple independent >> investigations and media reports have documented the systematic removal of >> previously public judgments, particularly those that reflect poorly on >> local authorities, expose procedural misconduct, involve politically >> sensitive issues, or contradict preferred political narratives. In late >> 2023, leaked SPC documents revealed instructions to migrate judgments into >> a new internal-only database accessible solely within the court system, >> while sharply reducing what remains publicly visible. Studies have shown >> that vast numbers of cases have already disappeared from public view. Major >> news organizations such as MIT Technology Review, Radio Free Asia, the >> South China Morning Post, and Reuters have all reported on this rollback of >> judicial transparency: >> – >> https://www.technologyreview.com/2023/12/20/1085741/china-judgements-online-transparency-government/ >> – >> https://www.rfa.org/english/news/china/china-court-records-12142023132626.html >> – >> https://www.scmp.com/news/china/politics/article/3246067/china-cut-back-access-court-rulings-sparking-concerns-about-judicial-transparency >> – >> https://www.reuters.com/world/china/china-vows-judicial-disclosure-after-outcry-over-plan-curb-access-rulings-2024-01-22/ >> >> For our purposes, the important point is this: CJO has removed or >> restricted access to large portions of its historical archive, including >> documents that were originally public, legally non-copyrightable under >> Chinese law, and crucial for understanding the functioning of China’s legal >> system. Many judgments that were once easily verifiable on the official >> site can no longer be checked against their original source. These >> documents are at risk of disappearing entirely from public access. >> >> An independent archiving project, caseopen.org, has preserved a large >> HTML snapshot of CJO’s judgments spanning 2013 to October 2024. The >> maintainers of caseopen.org have donated this dataset to Chinese >> Wikisource. The files capture the “online version” as it originally >> appeared on CJO, including formatting and errors, and therefore represent a >> unique opportunity to preserve a historical record of China’s legal system >> prior to this wave of censorship and delisting. In practical terms, this >> may be the last comprehensive public snapshot that will ever exist. >> >> On Chinese Wikisource, I have proposed importing this dataset through a >> bot (User:SuperGrey-bot). The local discussion, including technical details >> and code links, is here (in Chinese): >> https://zh.wikisource.org/wiki/Wikisource:机器人#User:SuperGrey-bot >> >> The scale of the corpus is extremely large: tens of millions of >> judgments, potentially more if we include non-judgment document types such >> as 裁定书 (ruling document) and 通知书 (notification document). We are planning a >> staged import, beginning with small test batches, then individual months, >> and only later the full corpus, once the community settles questions about >> formatting, titling, metadata, and scope. >> >> Because this project includes politically sensitive material and an >> unusual archival value, and because the scale is unprecedented for our >> language Wikisource, I would greatly appreciate advice and precedent from >> the international community. This is not only a technical or organizational >> task; it is also a preservation effort. We are attempting to safeguard >> public domain legal documents that have been systematically removed from >> public access. Wikisource may be one of the last neutral, open, global >> platforms capable of preserving this historical record. >> >> Given the potential size of the import, I would also appreciate input >> from the Wikimedia Foundation on any operational considerations. A >> multi-million–page import may affect storage, dumps, CirrusSearch indexing, >> and overall site performance. Before proceeding beyond small test batches, >> I would like to understand whether such an import is feasible within the >> current technical limits of Chinese Wikisource, and whether coordination >> with SRE or Cloud Services is recommended. >> >> Specifically, I would like to ask for input on the following areas: >> >> 1. Scope and suitability >> Have other Wikisources hosted similarly massive, uniform corpora of >> government or legal documents? How did you determine whether they fit the >> mission of Wikisource? Were there concerns about overwhelming the project >> or changing its character? >> >> 2. Verifiability and provenance >> In our case, the source is an independent mirror of a government website >> that is now selectively removing documents. While Wikimedia projects have >> long preserved public domain government documents after originals were >> taken down or censored, I am unsure how Wikisource communities have handled >> this scenario in practice. Are mirrored datasets acceptable when the >> original public source has been altered or removed? How should we document >> provenance and authenticity for future readers? >> >> 3. Organizational and technical considerations >> If we proceed, how should we structure this corpus so the project remains >> usable? Are there recommended practices for: >> – titling, metadata, and Wikidata integration for legal documents, >> – organizing millions of pages so they do not overwhelm categories and >> search, >> – mitigating strain on job queues, dumps, and indexing, >> – making future partial deletions or corrections feasible if political >> pressure or legal demands (e.g., DMCA takedown notices) ever arise? >> >> 4. Political and archival importance >> Wikisource has historically preserved documents at risk of censorship or >> disappearance, whether due to authoritarian restrictions or institutional >> neglect. Do other communities have experience with politically sensitive >> archival projects where the preservation value itself was a central >> motivation? >> >> At present, Chinese Wikisource is still deliberating basic formatting and >> policy questions. No large imports will be performed until a local >> consensus is clear. Although we are working from the independent >> caseopen.org snapshot rather than relying on ongoing availability of the >> official CJO site, the broader context is that public access to Chinese >> judicial decisions has already been substantially reduced in recent years. >> Because our dataset preserves a historical record that may not remain >> accessible through official channels, we believe this is an appropriate >> moment to seek broader input and learn from other Wikisource communities >> with similar archival experiences. >> >> Thank you very much for your time, advice, and any examples or concerns >> you can share. Even understanding which questions we should be asking would >> be extremely helpful. >> >> Best regards, >> Quinn Gao (User:SuperGrey) >> https://meta.wikimedia.org/wiki/User:SuperGrey >> _______________________________________________ >> Wikisource-l mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> > _______________________________________________ > Wikisource-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > -- Lane Rasberry user:bluerasberry 🟦🌀💙🌀🟦
_______________________________________________ Wikisource-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
