Malware Collection


  1. We collected malware samples from security blogs of the following companies:

    The source code of our blog crawler is available on GitHub.

  2. We further categorized the identified blog posts into:
    • Google Play Malware - posts that penetrate the Google Play store
    • Non-Google Play Malware - posts that describe Android malware from alternative markets
    • Non-Android Malware - posts that describe malware from systems other than Android (ex., iOS, PC, etc.)
    • Different Language - posts that are not in English
    • Technology/News/Promotions - posts that describe current technologies, trends, or product promotions

    The table below lists our categorization results for all 6,377 posts that we identified:

    Category201620172018201920202021Total
    Google Play Malware569651482439314
    Non-Google Play Malware937667445224356
    Non-Android Malware2122842382372211121,304
    Different Language1024172112271265
    News/Promotions6607987427787484124,138
    All1,0311,2781,1151,1281,1676586,377

    The full list of blog posts, and the assigned category for each, can be found in the “Blog Categorization” sheet of this excel file.

  3. We identified malware samples and their families based on the information from the blog posts. A post describes one or more families. Additionally, families described by two separate posts can refer to the same family. We identified duplicates families if a post references another post directly or describes indicators pointing to the same apps.

    The table below lists the unique and duplicated families we identified:

    Category201620172018201920202021Total
    Identified Families307149391927235
    Duplicate Families3128532051
    Unique Families27594134167184

    The full list of the identified families can be found in the “Malware Families” sheet of this excel file.

  4. We searched malware repositories (e.g., VirusTotal, VirusShare, Contagio), Android alternative markets (e.g., APKMonk, APKPure), and Android app repositories (e.g., AndroZoo) for samples using indicators described in the posts.

    The table below gives the number of families with detected samples and the total number of detected samples for each year:

    Category201620172018201920202021Total
    Found Families20472821153134
    Found Samples8963630116635111,238
  5. We manually analyzed one sample from 105 distinct families. We could not analyze samples from the rest of the families due to the use of packers, reflection, obfuscation, etc. In the next two tables, we show the distribution of analyzed samples, per year, and the distribution of sample we could not analyze, together with the underlying reasons:

    Category201620172018201920202021Total
    Samples Analyzed13342717114105
    Category201620172018201920202021Total
    Packer2303019
    Reflection0401005
    Cannot Find Malicious Behavior2200004
    No Decryption Key0010012
    Obfuscation3400209