These are some development notes I took a couple of weeks back while looking at this with limited ARM assembly knowledge. Most of the work is from CreepNT, someone just needs to complete it.

Firstly, there are MTE Intrinsics in arm_acle.h, making the amount of assembly we have to write effectively nil.

Let's talk about improvements!

get_random_tagged_pointer(ptr) can be replaced by __arm_mte_create_random_tag(ptr, 1 << MEM_TAG_FREE). The IRG instruction takes a mask of bits whose position corresponds to the excluded tag as shown in choose_nonexcluded_tag(...). The tag parameter is the previously generated tag (RGSR_EL1.TAG) and offset the result of random tag generation.

This also means the do/while loop in get_tagged_pointer can be optimized out to a call to __arm_mte_create_random_tag with a mask of (u64)(1 << MEM_TAG_FREE) | (1 << adj_tag_1) | (1 << adj_tag_2). For previously tagged allocations where we want to increment the tag until it is dissimilar from the adjacent allocations, a while loop seems fine as we're not actually going into the random tag path.

When storing tags over a range, instead of looping STG / STZG, the HWASAN folks have found it faster to do DC GVA / DC GZVA. This is a potential optimization we can explore in the future.

The approach taken by CreepNT when storing the tags in the slab metadata looks pretty nice too. Overall, MTE support seems pretty complete, just missing a few pieces here and there.

We want to deploy MTE as a library used by hardened_malloc rather than using the approach in the mentioned tree. CreepNT has been working on MTE recently but it's not on GitHub yet.

Here is the rewritten single-header library that wraps most (?all?) MTE functionality, mentionned earlier by @flawedworld.
I'll address a few points regarding what has been said in this thread quickly:

  1. Intrinsics
    When I started working on it, I don't think intrinsics were readily available, or at least I wasn't able to find them. For this reason I used raw ASM instead (which causes other issues down the line but I digress).
    I fully agree that intrinsics should be used if possible, and will take a look at them more closely.
    However, some optimizations I do in MTELib might not be possible without assembly (e.g. this); though I haven't done any profiling so maybe the gain is insignificant and this point is moot.
  2. The do/while in pointer tagging functions
    This comes from a misunderstanding on my side on how the random tag generation worked (I thought that all bits in the exclude mask were excluded in the random gen'ed tag, when I should have understood that bit X in exclude mask prevent tag X from being generated). This issue is addressed in MTELib.
  3. DC GVA/DC GZVA
    I wasn't aware those existed as they're not mentioned in the MTE whitepaper.
    From a quick glance in the AArch64 Instructions documentation, it looks like they might generate exceptions (though it looks related to hypervisor so maybe it's just hyp traps?). I will look into this more closely later.

I will try to integrate this library in hardened_malloc in a similar fashion to what I currently have on GitHub in the next few days.

    CreepNT

    it looks like they might generate exceptions (though it looks related to hypervisor so maybe it's just hyp traps?)

    From what I understand you need to read the value of DCZID_EL0.DZP before deciding to use DC GVA/DC GZVA. Hypervisors may set HCR_EL2.TDZ when supporting operating systems with differing block sizes, but don't quote me on that :)

    If you want another example of DC GZVA you can look at scudo's setTags.

    I’ve also been looking at other implementations and haven’t noticed much st2g usage though…