Hi all,
I am currently working with a large dataset (tens of thousands of 65x65
images) that is too large to fit in memory. As a result, I want to load
batches into memory as sets of twenty. For example, instead of loading all
60,000 mnist images, I want to only load the data in sets of twenty at a
time. My data is already divided into numpy arrays of this size and during
each iteration, I want to load a new batch into memory and pass it to
theano.function().
Using the base mlp example, I made minor edits to the code as follows:
1. I load the data as numpy arrays and not as shared variables (I am not
concerned with using the GPU for computations)
2. I still declare 'x' and 'y' as symbolic variables
x = T.matrix('x')
> y = T.ivector('y')
>
3. I modified theano.function() to use 'x' and 'y' as inputs instead of a
T.lscalar() variable (index) with givens.
test_model = theano.function(inputs=[x,y], outputs=classifier.errors(y),
> allow_input_downcast=True)
> validate_model = theano.function(inputs=[x,y],
> outputs=classifier.errors(y), allow_input_downcast=True)
> train_model = theano.function(inputs=[x,y], outputs=cost, updates=updates,
> allow_input_downcast=True)
>
4. I then try to load each new batch at the start of the loop (for this
initial phase, I am not worried about resetting the counter). However, it
seems that my model is not updating through iterations as my validation
scores and test scores for each iteration appear to be independent of the
previous round. My code edits are below:
counter = 0
>
> while (epoch < n_epochs) and (not done_looping):
> epoch = epoch + 1
> for minibatch_index in range(n_train_batches):
>
> datasets = load_data(dataset,counter)
> train_set_x, train_set_y = datasets[0]
> valid_set_x, valid_set_y = datasets[1]
> test_set_x, test_set_y = datasets[2]
> counter += 1
>
> minibatch_avg_cost = train_model(train_set_x,train_set_y)
> # iteration number
> iter = (epoch - 1) * n_train_batches + minibatch_index
>
> if (iter + 1) % validation_frequency == 0:
> # compute zero-one loss on validation set
> validation_losses = validate_model(valid_set_x,valid_set_y)
> this_validation_loss = numpy.mean(validation_losses)
>
> print(
> 'epoch %i, minibatch %i/%i, validation error %f %%' %
> (
> epoch,
> minibatch_index + 1,
> n_train_batches,
> this_validation_loss * 100.
> )
> )
>
> # if we got the best validation score until now
> if this_validation_loss < best_validation_loss:
> #improve patience if loss improvement is good enough
> if (
> this_validation_loss < best_validation_loss *
> improvement_threshold
> ):
> patience = max(patience, iter * patience_increase)
>
> best_validation_loss = this_validation_loss
> best_iter = iter
>
> # test it on the test set
> test_losses = test_model(test_set_x,test_set_y)
> test_score = numpy.mean(test_losses)
>
> print((' epoch %i, minibatch %i/%i, test error of '
> 'best model %f %%') %
> (epoch, minibatch_index + 1, n_train_batches,
> test_score * 100.))
>
> if patience <= iter:
> done_looping = True
> break
>
> end_time = timeit.default_timer()
> print(('Optimization complete. Best validation score of %f %% '
> 'obtained at iteration %i, with test performance %f %%') %
> (best_validation_loss * 100., best_iter + 1, test_score * 100.))
> print(('The code for file ' +
> os.path.split(__file__)[1] +
> ' ran for %.2fm' % ((end_time - start_time) / 60.)),
> file=sys.stderr)
>
I would like to know if someone can show me how best to do this as all I am
trying to do is move from loading an entire dataset into memory as a shared
variable to loading each batch into memory separately.
- Geoffrey
--
---
You received this message because you are subscribed to the Google Groups
"theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.