nv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False) Self.bn2 = norm_layer(planes, eps=bn_eps, momentum=bn_momentum)ĭef _init_(self, inplanes, planes, stride=1, Self.relu_inplace = nn.ReLU(inplace=True) Self.bn1 = norm_layer(planes, eps=bn_eps, momentum=bn_momentum) Return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,ĭef _init_(self, inplanes, planes, stride=1, norm_layer=None,īn_eps=1e-5, bn_momentum=0.1, downsample=None, inplace=True): Sure, please find below (apologies for the excess comments it’s a work in progress): I don’t think this would be the problem though since the code snippet with the error shows the very first operation (immediately prior to the classifier) causing the error so the autograd doesn’t even get to the backbone while propagating grads… I’m also using a standard ResNet implementation as my backbone and made sure inplace is set to False on all blocks/layers. I’ve seen another thread where the use of SyncBatchNorm may solve the error but this is currently only supported on GPU distributed training so isn’t really an option.įor the code snippet causing the error: f4 = self.last_conv(f3)Īttemtpted f4 = self.last_conv(f3.clone()) and f4 = self.last_conv(f3).clone() but to no avail. torch/csrc/autograd/python_anomaly_mode.cpp:104.)Īllow_unreachable=True, accumulate_grad=True) # allow_unreachable flag Input, weight, bias, running_mean, running_var, training, momentum, eps, V3plus_feature = self.head(blocks, depth) # (b, c, h, w)įile "/home/extraspace/anaconda3/envs/semiseg-test/lib/python3.6/site-packages/torch/nn/modules/container.py", line 141, in forwardįile "/home/extraspace/anaconda3/envs/semiseg-test/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 179, in forwardįile "/home/extraspace/anaconda3/envs/semiseg-test/lib/python3.6/site-packages/torch/nn/functional.py", line 2283, in batch_norm Traceback of forward call that caused the error:įile "train_depth_concat_cpu.py", line 469, in įile "/home/extraspace/anaconda3/envs/semiseg-test/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_implįile "/home/extraspace/anaconda3/envs/semiseg-test/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 888, in forward On implementing _detect_anomaly(True), further information is provided: /home/extraspace/anaconda3/envs/semiseg-test/lib/python3.6/site-packages/torch/autograd/_init_.py:156: UserWarning: Error detected in NativeBatchNormBackward0. The variable in question was changed in there or anywhere later. Hint: the backtrace further above shows the operation that failed to compute its gradient. RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: ] is at version 6 expected version 5 instead. Attempting to train a model using DDP/distributed training (‘gloo’ backend) but an error is thrown at the very first instance of loss.backward().
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |