Tip: When approaching a dataset, the first thing to do is to look at the data itself and observe its properties。 While the techniques here apply generally, you might want to opt to do certain things differently given your dataset。 For example, one standard preprocessing trick is to subtract the mean of each data point from itself (also known as remove DC, local mean subtraction, subtractive normalization)。 While this makes sense for data such as natural images, it is less obvious for data where stationarity does not hold。 

3。2 Data Normalization 

A standard first step to data preprocessing is data normalization。 While there are a few possible approaches, this step is usually clear depending on the data。 The common methods for feature normalization are: 

Simple Rescaling 

Per-example mean subtraction (a。k。a。 remove DC) 

Feature Standardization (zero-mean and unit variance for each feature across the dataset) 

3。2。1 Simple Rescaling 

In simple rescaling, our goal is to rescale the data along each data dimension (possibly independently) so that the final data vectors lie in the range [0,1] or [−1,1] (depending on your dataset)。 This is useful for later processing as many default parameters (e。g。, epsilon in PCA-whitening) treat the data as if it has been scaled to a reasonable range。 

Example: When processing natural images, we often obtain pixel values in the range [0,255]。 It is a common operation to rescale these values to [0,1] by piding the data by 255。 

3。2。2 Per-example mean subtraction 

If your data is stationary (i。e。, the statistics for each data dimension follow the same distribution), then you might want to consider subtracting the mean-value for each example (computed per-example)。 

Example: In images, this normalization has the property of removing the average brightness (intensity) of the data point。 In many cases, we are not interested in the illumination conditions of the image, but more so in the content; removing the average pixel value per data point makes sense here。 Note: While this method is generally used for images, one might want to take more care when applying this to color images。 In particular, the stationarity property does not generally apply across pixels in different color channels。 

3。2。3 Feature Standardization 

Feature standardization refers to (independently) setting each dimension of the data to have zero-mean and unit-variance。 This is the most common method for normalization and is generally used widely (e。g。, when working with SVMs, feature standardization is often recommended as a preprocessing step)。 In practice, one achieves this by first computing the mean of each dimension (across the dataset) and subtracts this from each dimension。 Next, each dimension is pided by its standard deviation。 

Example: When working with audio data, it is common to use MFCCs as the data representation。 However, the first component (representing the DC) of the MFCC features often overshadow the other components。 Thus, one method to restore balance to the components is to standardize the values in each component independently。 

3。3 PCA/ZCA Whitening 

After doing the simple normalizations, whitening is often the next preprocessing step employed that helps make our algorithms work better。 In practice, many deep learning algorithms rely on whitening to learn good features。 

In performing PCA/ZCA whitening, it is pertinent to first zero-mean the features (across the dataset) to ensure that 。 Specifically, this should be done before computing the covariance matrix。 (The only exception is when per-example mean subtraction is performed and the data is stationary across dimensions / pixels。) 

Next, one needs to select the value of epsilon to use when performing PCA/ZCA whitening (recall that this was the regularization term that has an effect of low-pass filtering the data)。 It turns out that selecting this value can also play an important role for feature learning, we discuss two cases for selecting epsilon: 

上一篇:轨道转化砂带的砂光机英文文献和中文翻译
下一篇:船舶建造规格书英文文献和中文翻译

数控机床制造过程的碳排...

新的数控车床加工机制英文文献和中文翻译

抗震性能的无粘结后张法...

锈蚀钢筋的力学性能英文文献和中文翻译

未加筋的低屈服点钢板剪...

台湾绿色B建筑节水措施英文文献和中文翻译

汽车内燃机连杆载荷和应...

ASP.net+sqlserver企业设备管理系统设计与开发

麦秸秆还田和沼液灌溉对...

安康汉江网讯

LiMn1-xFexPO4正极材料合成及充放电性能研究

张洁小说《无字》中的女性意识

我国风险投资的发展现状问题及对策分析

老年2型糖尿病患者运动疗...

新課改下小學语文洧效阅...

互联网教育”变革路径研究进展【7972字】

网络语言“XX体”研究