Project information

  • Affiliation: University Of California, Berkeley
  • Project Title: Malware Detection on Highly Imbalanced Data through Sequence Modeling

About the Project

In the real-world scenario, the number of malware is low compared to that of harmless applications. In this project, we explored the task of Android malware detection on a highly imbalanced dataset.

The detection is based on dynamic analysis of API call sequences made to the OS using deep learning techniques. We show that analyzing a sequence of the activities is informative for detecting malware. Our dataset has more than 180,000 samples, two-thirds of which are malware. This dataset is significantly larger than other datasets used in previous studies. We mimic real-world cases by randomly sampling a small portion of malware samples. Using the state-of-the-art model BERT, we show that it is possible to achieve desired malware detection performance with an extremely unbalanced dataset. We find that our BERT based model achieves an F1 score of 0.919 with just 0.5% of the examples being malware, which significantly outperforms current supervised (logistic regression, deep neural networks, SVM, LSTM and decision trees) and unsupervised (clustering, autoencoders, DAGMM) methods.

Publication:

Malware Detection on Highly Imbalanced Data through Sequence Modeling (link)

Rajvardhan Oak, Min Du, David Yan, Harshvardhan Takawale and Idan Amit
Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security, Nov. 2019