基于深度学习的图像分类程序被系统killed

戎老师,您好。我准备了2多万张行业图片(128*128),一共分为9大类,我用了《深度学习》第四章的cnn_mnist_modern.py的程序对这些图片进行分类,训练了很长一天一夜后,发现命令行提示killed。我是在服务器上运行的16个cpu加16 G的内存。这个问题有什么好的解决方法吗?谢谢!
 
下面是错误详细信息:
['categories', 'use_gray', 'trainimg', 'imgsize', 'trainlabel', 'testimg', 'testlabel']
15516 TRAIN IMAGES
16384 DIMENSIONAL INPUT
9 CLASSES
[128 128]
['Map' 'Cross' 'Seismic' 'Log' 'Plot' 'Photograph' 'History' 'Schema'
 'Other']
NETWORK READY

2017-12-03 21:06:17.050455: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-12-03 21:06:17.050629: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
FUNCTIONS READY
=================== TRAINABLE VARIABLES ===================
[0/8] [conv1/weights:0] / SAHPE IS (5, 5, 1, 32)
[1/8] [conv1/BatchNorm/beta:0] / SAHPE IS (32,)
[2/8] [conv2/weights:0] / SAHPE IS (5, 5, 1, 64)
[3/8] [conv2/BatchNorm/beta:0] / SAHPE IS (64,)
[4/8] [fc4/weights:0] / SAHPE IS (262144, 1024)
[5/8] [fc4/BatchNorm/beta:0] / SAHPE IS (1024,)
[6/8] [fco/weights:0] / SAHPE IS (1024, 9)
[7/8] [fco/biases:0] / SAHPE IS (9,)
SAVER READY
after creating savers at  Sun Dec  3 21:06:37 2017
Epoch 0  time is  Sun Dec  3 21:06:37 2017
Epoch 1  time is  Sun Dec  3 21:52:29 2017
time spent is  5541.511075735092
currrent time is  Sun Dec  3 22:38:59 2017
Epoch: 002/050 cost: 0.381516381
 TRAIN ACCURACY: 0.53800
在验证数据集上分 77 批计算准确度
 在验证数据集上的准确度为: 0.46753
 BEST EPOCH UPDATED!! [1] 
Epoch 2  time is  Sun Dec  3 22:41:36 2017
Epoch 3  time is  Sun Dec  3 23:26:44 2017
time spent is  5570.419316530228
currrent time is  Mon Dec  4 00:11:49 2017
Epoch: 004/050 cost: 0.065342006
 TRAIN ACCURACY: 0.63600
在验证数据集上分 77 批计算准确度
 在验证数据集上的准确度为: 0.58338
 [/home/test/DeepLearningCourseCodes-master/04_CNN_advances/data/nets/myfield128net-.ckpt] SAVED.
 BEST EPOCH UPDATED!! [3] 
 
......
Epoch 10  time is  Mon Dec  4 04:52:05 2017
Epoch 11  time is  Mon Dec  4 05:36:16 2017
time spent is  5459.990690231323
currrent time is  Mon Dec  4 06:20:32 2017
Epoch: 012/050 cost: 0.013094623
 TRAIN ACCURACY: 0.67400
在验证数据集上分 77 批计算准确度
 在验证数据集上的准确度为: 0.59844
 [/home/test/DeepLearningCourseCodes-master/04_CNN_advances/data/nets/myfield128net-.ckpt] SAVED.
Epoch 12  time is  Mon Dec  4 06:23:28 2017
Epoch 13  time is  Mon Dec  4 07:07:38 2017
time spent is  5521.78660941124
currrent time is  Mon Dec  4 07:52:34 2017
Epoch: 014/050 cost: 0.013434445
 TRAIN ACCURACY: 0.89600
在验证数据集上分 77 批计算准确度
 在验证数据集上的准确度为: 0.77143
 BEST EPOCH UPDATED!! [13] 
Epoch 14  time is  Mon Dec  4 07:54:57 2017
Epoch 15  time is  Mon Dec  4 08:39:46 2017
time spent is  5504.004289388657
currrent time is  Mon Dec  4 09:24:18 2017
Epoch: 016/050 cost: 0.010157777
 TRAIN ACCURACY: 0.93400
在验证数据集上分 77 批计算准确度
 在验证数据集上的准确度为: 0.76494
 [/home/test/DeepLearningCourseCodes-master/04_CNN_advances/data/nets/myfield128net-.ckpt] SAVED.
Epoch 16  time is  Mon Dec  4 09:27:00 2017
Epoch 17  time is  Mon Dec  4 10:11:53 2017
time spent is  5547.622578859329
currrent time is  Mon Dec  4 10:56:46 2017
Epoch: 018/050 cost: 0.006693343
Killed

CarlWu

赞同来自: wangxiaolei fish

我在网上搜了一圈,发现这个可能是由于cache逐渐变大,ubutu的oom-killer将某些进程杀死造成的。详情参见 http://blog.csdn.net/jisuanji_wjfioj/article/details/42420783 。我按照这篇文章,执行了下面命令:   echo -29 /proc/3433/oom_adj   其中3433是我的Python PID。看看这两天能否能跑完50个epoch。   在/var/log/kern.log文件中,可查到: Dec  4 19:29:30 ubuntu kernel: [ 5468.588483] Out of memory: Kill process 3803 (python) score 910 or sacrifice child Dec  4 19:29:30 ubuntu kernel: [ 5468.588996] Killed process 3803 (python) total-vm:80654152kB, anon-rss:14206004kB, file-rss:0kB, shmem-rss:0kB Dec  4 19:29:30 ubuntu kernel: [ 5469.348808] oom_reaper: reaped process 3803 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

CarlWu

赞同来自:

今天早晨过来看结果,发现ubuntu虚拟机崩溃,被彻底关闭。哪位遇到过同样的问题吗?提前感谢!!

要回复问题请先登录注册