Reynold Xin created SPARK-2496:
----------------------------------

             Summary: Compression streams should write its codec info to the 
stream
                 Key: SPARK-2496
                 URL: https://issues.apache.org/jira/browse/SPARK-2496
             Project: Spark
          Issue Type: Improvement
            Reporter: Reynold Xin
            Priority: Critical


Spark sometime store compressed data outside of Spark (e.g. event logs, blocks 
in tachyon), and those data are read back directly using the codec configured 
by the user. When the codec differs between runs, Spark wouldn't be able to 
read the codec back. 

I'm not sure what the best strategy here is yet. If we write the codec 
identifier for all streams, then we will be writing a lot of identifiers for 
shuffle blocks. One possibility is to only write it for blocks that will be 
shared across different Spark instances (i.e. managed outside of Spark), which 
includes tachyon blocks and event log blocks.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to