Hi again, After reading https://apacheignite.readme.io/docs/memory-configuration and https://apacheignite.readme.io/docs/evictions I have been able to configure the eviction and max size for a DataStorageConfiguration for a DataRegionConfiguration that I have associated to the IGFS dataCacheConfiguration. I understand that configuring the evictions and expirity through CacheConfiguration only makes sense for on-heap caches, which I guess was the only option available at the time the book was written. However some of my previous questions still apply, and I have a couple additional ones:
- When using IGFS with a secondary file system, the main storage is the secondary file system, but I might be interested in using Ignite persistence to be able to cache data not only in the memory of the Ignite workers, but also in their disk. For example I might have a data lake in a huge HDFS cluster, and separate smaller compute cluster where I want to run Spark and cache the data stored in the other HDFS by using IGFS. If I enable the persistence for a data region that is used for the IGFS dataCacheConfiguration, will the data be deleted from the disk of the Ignite servers when it is evicted, or only from the memory? I would like it to be deleted from the disk in this case, because the intention is using the disk for tiered storage of IGFS understood as a cache of the external HDFS cluster. Otherwise the disk might get filled because the HDFS cluster is much bigger than the compute cluster where IGFS is running. - Does IGFS sync the eviction of entries in the data and the metadata cache? Even if I use 2 different data regions for the 2 caches? A metadata entry with no data entries can be useful, but not the other way around - Is there any recommended ratio between the page size used for the DataStorageConfiguration for a DataRegionConfiguration used for the IGFS dataCacheConfiguration, and the block size configured for IGFS? Thanks again for all your help. Best Regards, Juan On Tue, Dec 12, 2017 at 6:38 PM, Juan Rodríguez Hortalá < [email protected]> wrote: > Hi, > > I'm trying to understand the configuration parameters for IGFS. My use > case is using IGFS with a secondary file system, thus acting as a cache for > a hadoop file system, without having to modify any existing application > (just the input and output path that will now use the igfs scheme). In the > javadoc for FileSystemConfiguration I see: > > int getPerNodeBatchSize() > Gets number of file blocks buffered on local node before sending batch to > remote node. > int getPerNodeParallelBatchCount() > Gets number of batches that can be concurrently sent to remote node. > int getPrefetchBlocks() > Get number of pre-fetched blocks if specific file's chunk is requested. > > What is the remote node here? I understand this doesn't have to do with > other ignite nodes holding backup copies, as that would be set in the cache > configuration. > > I have also taken a look to http://apache-ignite-users. > 70518.x6.nabble.com/IGFS-Data-cache-size-td2875.html but that post seems > to refer to a deprecated field FileSystemConfiguration.maxSpaceSize that > I haven't been able to see neither in the javadoc or in > https://github.com/apache/ignite/blob/2.3.0/modules/core/ > src/main/java/org/apache/ignite/configuration/FileSystemConfiguration.java. > Other questions that I have regarding Ignite configuration in the context > of this use case: > > - When I use ATOMIC for the atomicityMode of metaCacheConfiguration I get > an launch exception "Failed to start grid: IGFS metadata cache should be > transactional: igfs". So I understand TRANSACTIONAL is required for > metaCacheConfiguration, but I get no error when using ATOMIC > for dataCacheConfiguration, is there any reason to use TRANSACTIONAL for > dataCacheConfiguration? I understand ATOMIC gets better performance if you > don't use the transaction features. > > - The readThrough, writeThrough,writeBehind fields for the > CacheConfiguration dataCacheConfiguration and metaCacheConfiguration have > any effect? Or maybe IGFS is setting them according to the IgfsMode > configured in the defaultMode field of FileSystemConfiguration? > > - Similarly, does the setExpiryPolicyFactory in dataCacheConfiguration and > metaCacheConfiguration have any effect? I'd be interested in > using DUAL_ASYNC defaultMode, and I though that maybe the ExpiryPolicy > could give an upper bound for the time it takes for a record to be written > to the secondary file system, because it has been expired from the cache. > That way I could safely tear down the IGFS cluster after that time without > any data loss. Is there some way of achieving that? Otherwise I think > DUAL_ASYNC could only be used in long lived cluster, because I understand > there is no functionality to flush the IGFS caches into the secondary file > system. > > - Similarly, does the eviction policy configured for > dataCacheConfiguration and metaCacheConfiguration have any effect? In any > case I understand that IGFS can never fail due to having no more space in > the caches, because it will evict the requires entries, saving them to the > secondary file system if needed in order to avoid data loss. > > It would be nice if someone could point me to some webminar or > documentation specific for IGFS. I have already watched > https://www.youtube.com/watch?v=pshM_gy7Wig and I think it is a good > introduction, but I would like to get more details. I have also read the > book "High-Performance In-Memory Computing With Apache Ignite" > > Thanks a lot for all your help. > > Best Regards, > > Juan > > >
