dimon777 opened a new issue #9912: No training happening when CSVIter is used.
URL: https://github.com/apache/incubator-mxnet/issues/9912
 
 
   ## Description
   It appears to me CSVIter is broken or something else in MXNet which makes it 
impossible to train model with CSVIter feeds. I have a reproducible with CSV 
MNIST dataset (from here: https://pjreddie.com/projects/mnist-in-csv/)
   
   ## Environment info (Required)
   
   ```
   What to do:
   1. Download the diagnosis script from 
https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
   2. Run the script using `python diagnose.py` and paste its output here.
   
   $ python3 diagnose.py 
   ----------Python Info----------
   Version      : 3.6.4
   Compiler     : GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)
   Build        : ('default', 'Feb 18 2018 11:42:51')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 9.0.1
   Directory    : /usr/local/homebrew/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.1.0
   Directory    : /usr/local/homebrew/lib/python3.6/site-packages/mxnet
   Commit Hash   : 07a83a0325a3d782513a04f47d711710972cb144
   ----------System Info----------
   Platform     : Darwin-16.7.0-x86_64-i386-64bit
   system       : Darwin
   node         : MAC-DBuzolin
   release      : 16.7.0
   version      : Darwin Kernel Version 16.7.0: Thu Jan 11 22:59:40 PST 2018; 
root:xnu-3789.73.8~1/RELEASE_X86_64
   ----------Hardware Info----------
   machine      : x86_64
   processor    : i386
   b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT RDTSCP TSCI'
   b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 
BMI2 INVPCID FPU_CSDS'
   b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE 
MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ 
DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE 
POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
   b'machdep.cpu.brand_string: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz'
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0266 
sec, LOAD: 0.5727 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0413 sec, LOAD: 
0.1074 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1497 sec, LOAD: 
0.9886 sec.
   Timing for FashionMNIST: 
https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz,
 DNS: 0.0516 sec, LOAD: 0.8435 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0285 sec, LOAD: 
0.1718 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0313 sec, 
LOAD: 0.1717 sec.
   ```
   
   Package used (Python/R/Scala/Julia):
   Python
   
   ## Error Message:
   No error message but training is converging to "nan"
   
   ## Minimum reproducible example
   ```
   from __future__ import print_function
   import numpy as np
   import mxnet as mx
   from mxnet import nd, autograd, gluon
   import matplotlib.pyplot as plt
   from numpy import genfromtxt
   mx.random.seed(1)
   data_ctx = mx.cpu()
   model_ctx = mx.cpu()
   num_inputs=784
   data_shape = (num_inputs,)
   label_shape=(1,)
   num_outputs = 10
   batch_size = 32
   train_data = 
mx.io.CSVIter(data_csv="./data/mnist/mnist_iter_train_data.csv", 
data_shape=data_shape,
                              
label_csv="./data/mnist/mnist_iter_train_label.csv", label_shape=label_shape,
                              batch_size=batch_size, round_batch = False)
   test_data = mx.io.CSVIter(data_csv="./data/mnist/mnist_iter_test_data.csv", 
data_shape=data_shape,
                              
label_csv="./data/mnist/mnist_iter_test_label.csv", label_shape=label_shape,
                              batch_size=batch_size, round_batch = False)
   net = gluon.nn.Dense(num_outputs)
   net.collect_params().initialize(mx.init.Normal(sigma=.1), ctx=model_ctx)
   softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()
   def evaluate_accuracy(data_iterator, net):
       acc = mx.metric.Accuracy()
       for i, batch in enumerate(data_iterator):
           data = batch.data[0].as_in_context(model_ctx)/255 
#.reshape((-1,num_inputs))
           label = batch.label[0].as_in_context(model_ctx)
           output = net(data)
           predictions = nd.argmax(output, axis=1)
           acc.update(preds=predictions, labels=label)
       return acc.get()[1]
   epochs = 10
   moving_loss = 0.
   num_examples = 60000
   loss_sequence = []
   trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.1})
   for e in range(epochs):
       cumulative_loss = 0
       for i, batch in enumerate(train_data):
           data = batch.data[0].as_in_context(model_ctx)/255 
#.reshape((-1,num_inputs))
           label = batch.label[0].as_in_context(model_ctx)
           with autograd.record():
               output = net(data)
               loss = softmax_cross_entropy(output, label)
           loss.backward()
           trainer.step(batch_size)
           cumulative_loss += nd.sum(loss).asscalar()
   
       test_accuracy = evaluate_accuracy(test_data, net)
       train_accuracy = evaluate_accuracy(train_data, net)
       print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, 
round(cumulative_loss/num_examples,6), round(train_accuracy,4), 
round(test_accuracy,4)))
       loss_sequence.append(cumulative_loss)
   ```
   
   
   ## Steps to reproduce
   Run the script above to see this output:
   
   ```
   Epoch 0. Loss: 0.419388, Train_acc nan, Test_acc nan
   Epoch 1. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 2. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 3. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 4. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 5. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 6. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 7. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 8. Loss: 0.0, Train_acc nan, Test_acc nan
   Epoch 9. Loss: 0.0, Train_acc nan, Test_acc nan
   ```
   
   ## What have you tried to solve it?
   Similar example using NDArrays with small adjustments to training loop and 
data load functions works as expected.
   
   I can provide data which I used but they simply derived from above link and 
extracting labels from train/test files into separate labels files and removing 
labels from data files.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to