[jira] [Commented] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-27 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119970#comment-15119970
 ] 

Wang, Gang commented on SPARK-13004:


I have closed it according to your advice, Thanks.

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118916#comment-15118916
 ] 

Sean Owen commented on SPARK-13004:
---

OK, I don't think this JIRA itself is actionable then. I'd like to close it, 
but you can create JIRAs for specific targeted changes instead. However if they 
are large or invasive, you probably want to propose discussion on dev@ first 
before you go to the trouble.

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-26 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118288#comment-15118288
 ] 

Wang, Gang commented on SPARK-13004:


Yes, That is one of prototype for concept of proof. We are preparing to propose 
some specific changes for Spark. e.g. Non-volatile checkpoint. Non-volatile 
Caching and Storage, those would be all working around with non-volatile RDD. 
Thanks.

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-26 Thread Yanping Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118357#comment-15118357
 ] 

Yanping Wang commented on SPARK-13004:
--

Hi, Sean, in order for this model to work, we also developed a memory library 
to support this computation model. we are planning to donate this library to 
Apache incubator. If you are interested, I can send you a proposal draft. 

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13004) Support Non-Volatile Data and Operations

2016-01-26 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118468#comment-15118468
 ] 

Wang, Gang commented on SPARK-13004:


Yes, That is one of our prototypes for concept of proof. We are preparing to 
propose some specific changes for Spark. e.g. Non-volatile checkpoint. 
Non-volatile Caching and Storage, those would be all working around with 
non-volatile RDD. Thanks.

> Support Non-Volatile Data and Operations
> 
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
>  Issue Type: Epic
>  Components: Input/Output, Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Wang, Gang
>  Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant 
> negative performance impacts on majority of industrial Spark workloads, 
> especially, when the volumn of datasets are much larger than the system 
> memory volumns of Spark cluster available to caching, checkpoint, 
> shuffling/dispatching, data loading and Storing. the JVM on-heap management 
> would downgrade the performance as well when under pressure incurred by large 
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large 
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This 
> project focuses on adopting new features provided by server platform for 
> Spark applications and retrofitting the utilization of hybrid addressable 
> memory resources onto Spark whenever possible.
> *Data Object Managment*
>   * Using our non-volatile generic object programming model (NVGOP) to avoid 
> SerDe as well as reduce GC overhead.
>   * Minimizing memory footprint to load data lazily.
>   * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
>   * Using non-volatile/off-heap RDDs to transform Spark datasets.
>   * Avoiding the memory caching part by the way of in-place non-volatile RDD 
> operations.
>   * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>   
>   * Managing hereogeneous memory devices as an unified hybrid memory cache 
> pool for Spark.
>   * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
>   * Supporting to Reclaim allocated memory blocks automatically.
>   * Providing an unified memory block APIs for the general purpose of memory 
> usage.
>   
> *Computing device management*
>   * AVX instructions, programmable FPGA and GPU.
>   
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>   
> This epic tries to further improve the Spark performance with our 
> non-volatile solutions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org