You might have a look at Fuel: https://github.com/mila-udem/fuel
Or you might want to use a Container class using iterator overloading. The 
basic concept here is that you store
packages of data chunks (10GB/package) in single hdf5 files. The container 
reads the first package and creates shared variables
train/valid/test which are used by your model. Whenever the 1st package is 
fully processed, the container class loads the next package
from file and updates the shared variables under the hood. This results in 
the following procedure:

# fetch shared variables from Container class packages
train_x, train_y = packages.shareds()
... compile yout train_model_func with those shared variavbles, i.e. 
givens={train_x, train_y}
# train your model with
for p in packages
    for batch in mini_bach

What is nice about this concept is that you don't have to alter your 
training code, etc. The shared vars are updated under the hood. You can also
add transformer functions to this container class, adding features 
transforms etc. on the fly...
I added a code snippet for such a container class you might want to use.

hth. Mat

import sys
import os
import numpy as np
import random
import time
import glob
from scipy.io import loadmat

# load/save data
import cPickle
import pickle
#import deepdish as dd
import json
import copy
import scipy.io

import theano
     In this a container class for BIG data.
     It returns a loaded data package by the [] operator and
     a shared GPU variable when calling shared()
     The class can be used as a list like
         for p in MultiContainer:
             process p
     where p returns a data package (dict) which is loaded
     on demand. At the same time the shared() will be updated
     as well, that __container__ (CPU) corresponds to 
     __shared_container__ (GPU)
     The overall container size can be accessed via size(key)
class Container(object):
    def __init__(self, packages, name, history=0, normalization='standard', 

        self.__history = history
        self.__normalization = normalization
        self.__name = name                              
        self.__packages__ = packages               
        self.__shared_container__ = {}
        self.__idx__ = -1             
        self.__add_noise = noise
        self.__mean = None
        self.__variance = None
        self.n_packages = 
        # verbose output
        print('... using %s normalization for %s input' % (normalization, 
        # verbose output for activated transformer
        if self.__history > 0: print('... activating data transformations 
for %s using %i past frames' % (name, self.__history))

        # sanity check
        for p in self.__packages__:
            if not os.path.exists(p):
                print '  given package %s does not exist' % p

        adds past N frames to feature dimension (last dim)
        Note: uses np. notation
    def __transform(self, data, noise=None):
        # add noise if enabled
        if self.__add_noise == True:
            print 'TODO: adding noise'
        # apply normalization        
        if self.__normalization == 'standard':            
            data = (data - self.__mean) / (self.__variance + 1e-9)
        elif self.__normalization == 'zca':            
            data = np.dot(self.__zca, data)
            print '... given normalization %s not implemented' % 
            data = data
        # apply tranformation
        #a,b,c = data.shape
        #p = b % 10
        #data = data[:,:-p, :].reshape(a,b/10,c*10)
        # apply transformation
        if self.__history > 0:
            d = copy.copy(data)            
            for h in range(self.__history):
                d = np.concatenate( [data, np.roll(data, h, axis=1)], 
axis=1 )
            return d
        return data        
    def select(self, idx):
        if idx >= self.n_packages:
            print '... selected package index too high'
        self.__idx__  = idx
    def get_shape(self, key):
        if not self.__shared_container__.has_key(key):
            print '  key %s does not exists in shared' % key
        return self.__shared_container__[key].get_value().shape

    def shareds(self):
        return self.__shared_container__['data'], 
    def next(self):
        if self.__idx__ < self.n_packages - 
            self.__idx__ += 1
            return self            
            self.__idx__ = -1    
            raise StopIteration()
    def shuffle():
        self.__packages__ = np.random.permutation(self.__packages__)
    def name():
        return self.__name
    def __iter__(self):
        return self
    def __intialization_check(self):
        if not self.__shared_container__:
            print '... creating shared variables for %s' % self.__name
            # load pacakge                        
            package = self.__load(self.__packages__[self.__idx__])
            # noise
            noise = package['noise'] if self.__add_noise == True else 
            # normalization constances            
            if self.__mean == None: self.__mean = 
            if self.__variance == None:  self.__variance = 
            #self.__zca = package['zca']          
            # set shared
            self.__create_shared('data', self.__transform(package['data'])) 
# [:100], noise))
            self.__create_shared('labels', package['labels']) # 
            self.__create_shared('data_length', package['data_length'], 
'int32') # [:100], 'int32')
            self.__create_shared('labels_length', package['labels_length'], 
'int32') # [:100], 'int32')
    def __update_shared(self):
        # create shared if not exists        
        # update package
        package = self.__load(self.__packages__[self.__idx__])
        # noise
        noise = package['noise'] if self.__add_noise == True else None
        # apply transformer on input (if enabled)
        # @ CODE CHANGE
        data = self.__transform(package['data'], noise)# [:100]
        labels = package['labels']# [:100]          
        data_length = np.asarray(package['data_length'], 'int32')# [:100]
        labels_length = package['labels_length']# [:100]
        # set shared variable            

    def __create_shared(self, key, value, type='float32'):
        self.__shared_container__[key] = theano.shared(np.asarray(value, 
type), borrow=True, name='shared' + '_' + self.__name + '_' + key)
    def __load(self, file_path, type_='python'):

        obj = None
            if type_ == 'matlab':
                obj = scipy.io.loadmat(file_path)
            elif type_ == 'json':
                f = file(file_path, 'r')
                obj = json.load(f)
                f = file(file_path, 'r')
                obj = cPickle.load(f)
            print '... error loading data from file %s' % 

        return obj
if __name__ == '__main__':
    file_list = ['', '', '']
    container = Container(file_list)

Am Montag, 19. September 2016 11:30:42 UTC+2 schrieb Aditya Vora:
> I am similiarly dealing with large number of video clips. I had stored the 
> clips in the LMDB format. I am loading the dataset as follows: 
> X_train, y_train, _ = load_data(some arguments)
> X_val, y_val, _ = load_data(some arguments)
> X_test, y_test = load_data(some arguments)
> data = (X_train, y_train, X_val, y_val, X_test, y_test)
> Here number of X_train clips is 912, X_val and X_test is 144.
> Is this the apt way to do it. My program shows memory error when I try to 
> expand the dataset size. Can anyone suggest some way to iteratively load 
> the data one by one directly from the LMDB database. Some block of code 
> will be very helpful.
> On Thursday, September 15, 2016 at 8:57:53 PM UTC+5:30, Jose Carranza 
> wrote:
>> Hi guys
>> I have a fairly big dataset (million+ images for train) that I want to 
>> use to train from scratch a model in Theano. In Caffe we use LMDB however I 
>> haven't seen any best practice in Theano for something bigger than MNIST 
>> and stuff like that. Can somebody suggest what is the best option to pull 
>> data into Theano/Lasagne? I need something that is not 100% in memory but 
>> that can pull in batches (hopefully also shuffled batches).
>> Thx in advance


You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to theano-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to