Halfak created this task.
Halfak added a subscriber: Halfak.
Halfak added a project: Wikidata.
Halfak moved this task to incoming on the Wikidata workboard.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION
  **Request:** Currently the JSON dumps are compressed using gzip.  I propose 
to also provide a file compressed with bzip2. 
  
  **Reason:** I've been working with people who would like to process WikiData 
stuff in Hadoop/Spark.  Inside of that environment, bzip2 is better supported 
than gzip because of the block compression strategy that it uses.  Currently, 
we need to recompress json dumps in order to take full advantage of these 
distributed processing frameworks.  It would be very helpful for us and our 
workflow if the dumps could be provided in a bzip2 compressed format.
  
  See 
http://stackoverflow.com/questions/6511255/why-cant-hadoop-split-up-a-large-text-file-and-then-compress-the-splits-using-g
 and 
http://stackoverflow.com/questions/14820450/best-splittable-compression-for-hadoop-input-bz2.

TASK DETAIL
  https://phabricator.wikimedia.org/T115222

WORKBOARD
  https://phabricator.wikimedia.org/project/board/71/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Halfak
Cc: Halfak, Aklapper, Wikidata-bugs, aude



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to