Hi 

You might have a look at Fuel: https://github.com/mila-udem/fuel
Or you might want to use a Container class using iterator overloading. The 
basic concept here is that you store
packages of data chunks (10GB/package) in single hdf5 files. The container 
reads the first package and creates shared variables
train/valid/test which are used by your model. Whenever the 1st package is 
fully processed, the container class loads the next package
from file and updates the shared variables under the hood. This results in 
the following procedure:

# fetch shared variables from Container class packages
train_x, train_y = packages.shareds()
... compile yout train_model_func with those shared variavbles, i.e. 
givens={train_x, train_y}
# train your model with
for p in packages
    for batch in mini_bach
        train_model(batch)

What is nice about this concept is that you don't have to alter your 
training code, etc. The shared vars are updated under the hood. You can also
add transformer functions to this container class, adding features 
transforms etc. on the fly...
I added a code snippet for such a container class you might want to use.

hth. Mat

import sys
import os
import numpy as np
import random
import time
import glob
from scipy.io import loadmat

# load/save data
import cPickle
import pickle
#import deepdish as dd
import json
import copy
import scipy.io

import theano
  
# 
------------------------------------------------------------------------------
# MULTIPLE DATA CONTAINER CLASS (CPU)
# 
------------------------------------------------------------------------------
"""
     In this a container class for BIG data.
     It returns a loaded data package by the [] operator and
     a shared GPU variable when calling shared()
     
     The class can be used as a list like
     
         for p in MultiContainer:
             process p
     
     where p returns a data package (dict) which is loaded
     on demand. At the same time the shared() will be updated
     as well, that __container__ (CPU) corresponds to 
     __shared_container__ (GPU)
     
     The overall container size can be accessed via size(key)
     
"""
class Container(object):
        
    # 
--------------------------------------------------------------------------
    def __init__(self, packages, name, history=0, normalization='standard', 
noise=False):

        self.__history = history
        self.__normalization = normalization
        self.__name = name                              
        self.__packages__ = packages               
        self.__shared_container__ = {}
        self.__idx__ = -1             
        self.__add_noise = noise
        self.__mean = None
        self.__variance = None
        self.n_packages = 
len(self.__packages__)                                                 
        
        # verbose output
        print('... using %s normalization for %s input' % (normalization, 
name))
        
        # verbose output for activated transformer
        if self.__history > 0: print('... activating data transformations 
for %s using %i past frames' % (name, self.__history))

        # sanity check
        for p in self.__packages__:
            if not os.path.exists(p):
                print '  given package %s does not exist' % p
                sys.exit() 

    # 
--------------------------------------------------------------------------
    '''
        adds past N frames to feature dimension (last dim)
        Note: uses np. notation
    '''
    def __transform(self, data, noise=None):
        
        # add noise if enabled
        if self.__add_noise == True:
            print 'TODO: adding noise'
        
        # apply normalization        
        if self.__normalization == 'standard':            
            data = (data - self.__mean) / (self.__variance + 1e-9)
        elif self.__normalization == 'zca':            
            data = np.dot(self.__zca, data)
        else:
            print '... given normalization %s not implemented' % 
self.__normalization
            data = data
   
        # apply tranformation
        #a,b,c = data.shape
        #p = b % 10
        #data = data[:,:-p, :].reshape(a,b/10,c*10)
            
        # apply transformation
        if self.__history > 0:
            d = copy.copy(data)            
            for h in range(self.__history):
                d = np.concatenate( [data, np.roll(data, h, axis=1)], 
axis=1 )
            return d
        
        return data        
        
    # 
--------------------------------------------------------------------------
    def select(self, idx):
        
        if idx >= self.n_packages:
            print '... selected package index too high'
            sys.exit()
        self.__idx__  = idx
        self.__update_shared()
        
    # 
--------------------------------------------------------------------------
    def get_shape(self, key):
        
        self.__intialization_check()
        if not self.__shared_container__.has_key(key):
            print '  key %s does not exists in shared' % key
            sys.exit()                               
        return self.__shared_container__[key].get_value().shape

    # 
--------------------------------------------------------------------------
    def shareds(self):
        
        self.__intialization_check()            
        return self.__shared_container__['data'], 
self.__shared_container__['labels'], 
self.__shared_container__['data_length'], 
self.__shared_container__['labels_length']
            
    # 
--------------------------------------------------------------------------
    def next(self):
        if self.__idx__ < self.n_packages - 
1:                                                    
            self.__idx__ += 1
            self.__update_shared()            
            return self            
        else:            
            self.__idx__ = -1    
            raise StopIteration()
    
    # 
--------------------------------------------------------------------------
    def shuffle():
        
        self.__packages__ = np.random.permutation(self.__packages__)
    
    # 
--------------------------------------------------------------------------
    def name():
        return self.__name
                
    # 
--------------------------------------------------------------------------
    def __iter__(self):
        return self
    
    # 
--------------------------------------------------------------------------
    def __intialization_check(self):
        
        if not self.__shared_container__:
            print '... creating shared variables for %s' % self.__name
            # load pacakge                        
            package = self.__load(self.__packages__[self.__idx__])
            # noise
            noise = package['noise'] if self.__add_noise == True else 
None            
            # normalization constances            
            if self.__mean == None: self.__mean = 
package['mean']               
            if self.__variance == None:  self.__variance = 
package['variance']
            #self.__zca = package['zca']          
            # set shared
            self.__create_shared('data', self.__transform(package['data'])) 
# [:100], noise))
            self.__create_shared('labels', package['labels']) # 
[:100])                                    
            self.__create_shared('data_length', package['data_length'], 
'int32') # [:100], 'int32')
            self.__create_shared('labels_length', package['labels_length'], 
'int32') # [:100], 'int32')
           
    # 
--------------------------------------------------------------------------
    def __update_shared(self):
        
        # create shared if not exists        
        self.__intialization_check()
        # update package
        package = self.__load(self.__packages__[self.__idx__])
        # noise
        noise = package['noise'] if self.__add_noise == True else None
        # apply transformer on input (if enabled)
        
        # @ CODE CHANGE
        data = self.__transform(package['data'], noise)# [:100]
        labels = package['labels']# [:100]          
        data_length = np.asarray(package['data_length'], 'int32')# [:100]
        labels_length = package['labels_length']# [:100]
        # set shared variable            
        self.__shared_container__['data'].set_value(data)
        
self.__shared_container__['labels'].set_value(labels)                        

        
self.__shared_container__['data_length'].set_value(np.asarray(data_length,'int32'))
        
self.__shared_container__['labels_length'].set_value(np.asarray(labels_length,'int32'))
                    
    # 
--------------------------------------------------------------------------
    def __create_shared(self, key, value, type='float32'):
                    
        self.__shared_container__[key] = theano.shared(np.asarray(value, 
type), borrow=True, name='shared' + '_' + self.__name + '_' + key)
    
    # 
--------------------------------------------------------------------------
    def __load(self, file_path, type_='python'):

        obj = None
        try:
            if type_ == 'matlab':
                obj = scipy.io.loadmat(file_path)
            elif type_ == 'json':
                f = file(file_path, 'r')
                obj = json.load(f)
                f.close()
            else:
                f = file(file_path, 'r')
                obj = cPickle.load(f)
                f.close()
        except:
            print '... error loading data from file %s' % 
file_path            

        return obj
                
# 
---------------------------------------------------------------------------------
if __name__ == '__main__':
    
    file_list = ['', '', '']
    container = Container(file_list)



Am Montag, 19. September 2016 11:30:42 UTC+2 schrieb Aditya Vora:
>
> I am similiarly dealing with large number of video clips. I had stored the 
> clips in the LMDB format. I am loading the dataset as follows: 
> X_train, y_train, _ = load_data(some arguments)
> X_val, y_val, _ = load_data(some arguments)
> X_test, y_test = load_data(some arguments)
>
> data = (X_train, y_train, X_val, y_val, X_test, y_test)
>
> Here number of X_train clips is 912, X_val and X_test is 144.
> Is this the apt way to do it. My program shows memory error when I try to 
> expand the dataset size. Can anyone suggest some way to iteratively load 
> the data one by one directly from the LMDB database. Some block of code 
> will be very helpful.
>
> On Thursday, September 15, 2016 at 8:57:53 PM UTC+5:30, Jose Carranza 
> wrote:
>>
>> Hi guys
>>
>> I have a fairly big dataset (million+ images for train) that I want to 
>> use to train from scratch a model in Theano. In Caffe we use LMDB however I 
>> haven't seen any best practice in Theano for something bigger than MNIST 
>> and stuff like that. Can somebody suggest what is the best option to pull 
>> data into Theano/Lasagne? I need something that is not 100% in memory but 
>> that can pull in batches (hopefully also shuffled batches).
>>
>> Thx in advance
>>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to