[theano-users] Theano Large Dataset Help

Geoffrey Williams Wed, 10 Aug 2016 13:10:21 -0700

Hi all,

I am currently working with a large dataset (tens of thousands of 65x65 
images) that is too large to fit in memory.  As a result, I want to load 
batches into memory as sets of twenty.  For example, instead of loading all 
60,000 mnist images, I want to only load the data in sets of twenty at a 
time.  My data is already divided into numpy arrays of this size and during 
each iteration, I want to load a new batch into memory and pass it to 
theano.function().


Using the base mlp example, I made minor edits to the code as follows:
1. I load the data as numpy arrays and not as shared variables (I am not 
concerned with using the GPU for computations)
2. I still declare 'x' and 'y' as symbolic variables

x = T.matrix('x') 
> y = T.ivector('y') 
>

3.  I modified theano.function() to use 'x' and 'y' as inputs instead of a 
T.lscalar() variable (index) with givens.

test_model = theano.function(inputs=[x,y], outputs=classifier.errors(y), 
> allow_input_downcast=True)
> validate_model = theano.function(inputs=[x,y], 
> outputs=classifier.errors(y), allow_input_downcast=True)
> train_model = theano.function(inputs=[x,y], outputs=cost, updates=updates, 
> allow_input_downcast=True)
>

4. I then try to load each new batch at the start of the loop (for this 
initial phase, I am not worried about resetting the counter).  However, it 
seems that my model is not updating through iterations as my validation 
scores and test scores for each iteration appear to be independent of the 
previous round.  My code edits are below:

    counter = 0 
>
>     while (epoch < n_epochs) and (not done_looping):
>         epoch = epoch + 1
>         for minibatch_index in range(n_train_batches):
>
>             datasets = load_data(dataset,counter)
>             train_set_x, train_set_y = datasets[0]
>             valid_set_x, valid_set_y = datasets[1]
>             test_set_x, test_set_y = datasets[2]
>             counter += 1
>
>             minibatch_avg_cost = train_model(train_set_x,train_set_y)
>             # iteration number
>             iter = (epoch - 1) * n_train_batches + minibatch_index
>
>             if (iter + 1) % validation_frequency == 0:
>                 # compute zero-one loss on validation set
>                 validation_losses = validate_model(valid_set_x,valid_set_y)
>                 this_validation_loss = numpy.mean(validation_losses)
>
>                 print(
>                     'epoch %i, minibatch %i/%i, validation error %f %%' %
>                     (
>                         epoch,
>                         minibatch_index + 1,
>                         n_train_batches,
>                         this_validation_loss * 100.
>                     )
>                 )
>
>                 # if we got the best validation score until now
>                 if this_validation_loss < best_validation_loss:
>                     #improve patience if loss improvement is good enough
>                     if (
>                         this_validation_loss < best_validation_loss *
>                         improvement_threshold
>                     ):
>                         patience = max(patience, iter * patience_increase)
>
>                     best_validation_loss = this_validation_loss
>                     best_iter = iter
>
>                     # test it on the test set
>                     test_losses = test_model(test_set_x,test_set_y)
>                     test_score = numpy.mean(test_losses)
>
>                     print(('     epoch %i, minibatch %i/%i, test error of '
>                            'best model %f %%') %
>                           (epoch, minibatch_index + 1, n_train_batches,
>                            test_score * 100.))
>
>             if patience <= iter:
>                 done_looping = True
>                 break
>
>     end_time = timeit.default_timer()
>     print(('Optimization complete. Best validation score of %f %% '
>            'obtained at iteration %i, with test performance %f %%') %
>           (best_validation_loss * 100., best_iter + 1, test_score * 100.))
>     print(('The code for file ' +
>            os.path.split(__file__)[1] +
>            ' ran for %.2fm' % ((end_time - start_time) / 60.)), 
> file=sys.stderr)
>


I would like to know if someone can show me how best to do this as all I am 
trying to do is move from loading an entire dataset into memory as a shared 
variable to loading each batch into memory separately.

- Geoffrey
 

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[theano-users] Theano Large Dataset Help

Reply via email to