4。3。1 Availability of data

With the method described above, one relies only on labeled data for training。 However, labeled data is often scarce, and thus for many problems it is difficult to get enough examples to fit the parameters of a complex model。 For example, given the high degree of expressive power of deep networks, training on insufficient data would also result in overfitting。 

4。3。2 Local optima

Training a shallow network (with 1 hidden layer) using supervised learning usually resulted in the parameters converging to reasonable values; but when we are training a deep network, this works much less well。 In particular, training a neural network using supervised learning involves solving a highly non-convex optimization problem (say, minimizing the training error as a function of the network parameters W)。 In a deep network, this problem turns out to be rife with bad local optima, and training with gradient descent (or methods like conjugate gradient and L-BFGS) no longer work well。 

4。3。3 Diffusion of gradients

There is an additional technical reason, pertaining to the gradients becoming very small, that explains why gradient descent (and related algorithms like L-BFGS) do not work well on a deep networks with randomly initialized weights。 Specifically, when using backpropagation to compute the derivatives, the gradients that are propagated backwards (from the output layer to the earlier layers of the network) rapidly diminish in magnitude as the depth of the network increases。 As a result, the derivative of the overall cost with respect to the weights in the earlier layers is very small。 Thus, when using gradient descent, the weights of the earlier layers change slowly, and the earlier layers fail to learn much。 This problem is often called the "diffusion of gradients。" 

A closely related problem to the diffusion of gradients is that if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the labeled data alone without the help of the earlier layers。 Hence, training the entire network at once with all the layers randomly initialized ends up giving similar performance to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers)。 

4。4 Greedy layer-wise training 

How can we train a deep network? One method that has seen some success is the greedy layer-wise training method。 We describe this method in detail in later sections, but briefly, the main idea is to train the layers of the network one at a time, so that we first train a network with 1 hidden layer, and only after that is done, train a network with 2 hidden layers, and so on。 At each step, we take the old network with k − 1 hidden layers, and add an additional k-th hidden layer (that takes as input the previous hidden layer k − 1 that we had just trained)。 Training can either be supervised (say, with classification error as the objective function on each step), but more frequently it is unsupervised (as in an autoencoder; details to provided later)。 The weights from training the layers inpidually are then used to initialize the weights in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i。e。, trained together to optimize the labeled training set error)。 

The success of greedy layer-wise training has been attributed to a number of factors: 

4。4。1 Availability of data

While labeled data can be expensive to obtain, unlabeled data is cheap and plentiful。 The promise of self-taught learning is that by exploiting the massive amount of unlabeled data, we can learn much better models。 By using unlabeled data to learn a good initial value for the weights in all the layers  (except for the final classification layer that maps to the outputs/predictions), our algorithm is able to learn and discover patterns from massively more amounts of data than purely supervised approaches。 This often results in much better classifiers being learned。 

上一篇:轨道转化砂带的砂光机英文文献和中文翻译
下一篇:船舶建造规格书英文文献和中文翻译

数控机床制造过程的碳排...

新的数控车床加工机制英文文献和中文翻译

抗震性能的无粘结后张法...

锈蚀钢筋的力学性能英文文献和中文翻译

未加筋的低屈服点钢板剪...

台湾绿色B建筑节水措施英文文献和中文翻译

汽车内燃机连杆载荷和应...

ASP.net+sqlserver企业设备管理系统设计与开发

麦秸秆还田和沼液灌溉对...

安康汉江网讯

LiMn1-xFexPO4正极材料合成及充放电性能研究

张洁小说《无字》中的女性意识

我国风险投资的发展现状问题及对策分析

老年2型糖尿病患者运动疗...

新課改下小學语文洧效阅...

互联网教育”变革路径研究进展【7972字】

网络语言“XX体”研究