TY - GEN
T1 - Prediction performance improvement for highly imbalanced monitoring data
AU - Li, Yuhua
AU - Maguire, Liam
AU - McCann, Michael
AU - Johnston, Adrian
PY - 2010
Y1 - 2010
N2 - In engineering applications, we often face highly imbalanced data problems where majority of the data are from a condition and small minority are from others. Directly learning classifier on such problems would be prone to a biased classification performance by the majority class, so resulting in poor predication on the minority class. This paper proposes a method for balancing training data, which over-samples the minority class. The method uses between-class and within-class information to decide the vicinity space of an example. It generates synthetic examples along orthogonal directions in the vicinity, so it ensures the generated synthetic examples well represent the entire vicinity space and be more similar to minority class than majority class. The method is easy to use, as it involves no parameter setting. A real world problem of semiconductor manufacturing line monitoring and process control data is used to demonstrate that classification performance can be significantly improved through learning on balanced data by the proposed method.
AB - In engineering applications, we often face highly imbalanced data problems where majority of the data are from a condition and small minority are from others. Directly learning classifier on such problems would be prone to a biased classification performance by the majority class, so resulting in poor predication on the minority class. This paper proposes a method for balancing training data, which over-samples the minority class. The method uses between-class and within-class information to decide the vicinity space of an example. It generates synthetic examples along orthogonal directions in the vicinity, so it ensures the generated synthetic examples well represent the entire vicinity space and be more similar to minority class than majority class. The method is easy to use, as it involves no parameter setting. A real world problem of semiconductor manufacturing line monitoring and process control data is used to demonstrate that classification performance can be significantly improved through learning on balanced data by the proposed method.
UR - http://www.scopus.com/inward/record.url?scp=84905739015&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84905739015
SN - 9781618390134
T3 - 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies 2010, CM 2010/MFPT 2010
SP - 169
EP - 176
BT - 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies 2010, CM 2010/MFPT 2010
PB - British Institute of Non-Destructive Testing
T2 - 7th International Conference on Condition Monitoring and Machinery Failure Prevention Technologies 2010, CM 2010/MFPT 2010
Y2 - 22 June 2010 through 24 June 2010
ER -