Filtering Image-based Spam Using Multifractal Analysis and Active Learning Feedback-Driven Semi-Supervised Support Vector Machine
Abstract—Filtering Image-based Spam Using Multifractal Analysis and Active Learning Feedback-Driven Semi-Supervised Support Vector Machine. Traditional anti-spam technologies can’t block image-based spam because spammers employ a variety of image creation and randomization algorithms to make the message fully legible by the human eye but undistinguishable by the most anti-spam engines. In this paper we propose a novel composite method to filter image-based spam accurately and effectively, which can be easily implemented as a plug-in in SpamAssassin. Our method takes advantage of the two natures of image-based spams: large quantity, similarity and character variability. For the first nature, we use rules of SpamAssassin to detect the emails characteristic. If a new email has been identified as spam by the rules, it will be blocked. Otherwise, image-based mail will be captured by the plug-in. For the second nature,the plug-in will use multifractal analysis in multi-orientation wavelet pyramid algorithm to get image-based email texture descriptor which has strong invariance to many factors, use a hybrid filter-wrapper feature subset selection algorithm based on particle swarm optimization to reduce some redundant or irrelevant features in the texture descriptor, < Final Year Projects > and use a semi-supervised support vector machines classification algorithm to detect whether an email is ham or spam, then use active learning clustering to get the most representative emails for relabeling through user feedback. The relabeled emails by users feedback and the unlabeled suspect spams by SVM will be used to retrain the classification for improving accuracy of spam filter. The experimental results demonstrate that our method is of high efficiency, high accuracy and low false positive rate. The accuracy will be improved and the false positive rate will be reduced along with more and more retraining. So, the method is fit especially for an adversarial learning and processing like spam filtering.