dimon777 opened a new issue #9912: No training happening when CSVIter is used. URL: https://github.com/apache/incubator-mxnet/issues/9912 ## Description It appears to me CSVIter is broken or something else in MXNet which makes it impossible to train model with CSVIter feeds. I have a reproducible with CSV MNIST dataset (from here: https://pjreddie.com/projects/mnist-in-csv/) ## Environment info (Required) ``` What to do: 1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py 2. Run the script using `python diagnose.py` and paste its output here. $ python3 diagnose.py ----------Python Info---------- Version : 3.6.4 Compiler : GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2) Build : ('default', 'Feb 18 2018 11:42:51') Arch : ('64bit', '') ------------Pip Info----------- Version : 9.0.1 Directory : /usr/local/homebrew/lib/python3.6/site-packages/pip ----------MXNet Info----------- Version : 1.1.0 Directory : /usr/local/homebrew/lib/python3.6/site-packages/mxnet Commit Hash : 07a83a0325a3d782513a04f47d711710972cb144 ----------System Info---------- Platform : Darwin-16.7.0-x86_64-i386-64bit system : Darwin node : MAC-DBuzolin release : 16.7.0 version : Darwin Kernel Version 16.7.0: Thu Jan 11 22:59:40 PST 2018; root:xnu-3789.73.8~1/RELEASE_X86_64 ----------Hardware Info---------- machine : x86_64 processor : i386 b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT RDTSCP TSCI' b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID FPU_CSDS' b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C' b'machdep.cpu.brand_string: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz' ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0266 sec, LOAD: 0.5727 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0413 sec, LOAD: 0.1074 sec. Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1497 sec, LOAD: 0.9886 sec. Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0516 sec, LOAD: 0.8435 sec. Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0285 sec, LOAD: 0.1718 sec. Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0313 sec, LOAD: 0.1717 sec. ``` Package used (Python/R/Scala/Julia): Python ## Error Message: No error message but training is converging to "nan" ## Minimum reproducible example ``` from __future__ import print_function import numpy as np import mxnet as mx from mxnet import nd, autograd, gluon import matplotlib.pyplot as plt from numpy import genfromtxt mx.random.seed(1) data_ctx = mx.cpu() model_ctx = mx.cpu() num_inputs=784 data_shape = (num_inputs,) label_shape=(1,) num_outputs = 10 batch_size = 32 train_data = mx.io.CSVIter(data_csv="./data/mnist/mnist_iter_train_data.csv", data_shape=data_shape, label_csv="./data/mnist/mnist_iter_train_label.csv", label_shape=label_shape, batch_size=batch_size, round_batch = False) test_data = mx.io.CSVIter(data_csv="./data/mnist/mnist_iter_test_data.csv", data_shape=data_shape, label_csv="./data/mnist/mnist_iter_test_label.csv", label_shape=label_shape, batch_size=batch_size, round_batch = False) net = gluon.nn.Dense(num_outputs) net.collect_params().initialize(mx.init.Normal(sigma=.1), ctx=model_ctx) softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() def evaluate_accuracy(data_iterator, net): acc = mx.metric.Accuracy() for i, batch in enumerate(data_iterator): data = batch.data[0].as_in_context(model_ctx)/255 #.reshape((-1,num_inputs)) label = batch.label[0].as_in_context(model_ctx) output = net(data) predictions = nd.argmax(output, axis=1) acc.update(preds=predictions, labels=label) return acc.get()[1] epochs = 10 moving_loss = 0. num_examples = 60000 loss_sequence = [] trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1}) for e in range(epochs): cumulative_loss = 0 for i, batch in enumerate(train_data): data = batch.data[0].as_in_context(model_ctx)/255 #.reshape((-1,num_inputs)) label = batch.label[0].as_in_context(model_ctx) with autograd.record(): output = net(data) loss = softmax_cross_entropy(output, label) loss.backward() trainer.step(batch_size) cumulative_loss += nd.sum(loss).asscalar() test_accuracy = evaluate_accuracy(test_data, net) train_accuracy = evaluate_accuracy(train_data, net) print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, round(cumulative_loss/num_examples,6), round(train_accuracy,4), round(test_accuracy,4))) loss_sequence.append(cumulative_loss) ``` ## Steps to reproduce Run the script above to see this output: ``` Epoch 0. Loss: 0.419388, Train_acc nan, Test_acc nan Epoch 1. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 2. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 3. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 4. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 5. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 6. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 7. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 8. Loss: 0.0, Train_acc nan, Test_acc nan Epoch 9. Loss: 0.0, Train_acc nan, Test_acc nan ``` ## What have you tried to solve it? Similar example using NDArrays with small adjustments to training loop and data load functions works as expected. I can provide data which I used but they simply derived from above link and extracting labels from train/test files into separate labels files and removing labels from data files.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services