Writing your own RDI /sRDI loader using C and ASM

In this post, I am going to show the readers how to write their own RDI/sRDI loader in C, and then show how to optimize the code to make it fully position independent.

Writing your own RDI /sRDI loader using C and ASM

As a security researcher and malware developer, being able to create your own loaders using techniques like RDI/sRDI can assist in avoiding detection by security software and can increase the longevity of a given malware variant. The more unique an implant is, the more difficult it is for security software to detect and analyze, making it more effective in its intended purpose. Learning how to write your own loaders using techniques such as RDI/sRDI loading is a crucial element of maintaining a competitive edge in the field.

TL; DR

The final code example was added to my repo below, but if you are learning, I highly recommend re-typing everything as you go to get the most of out it.

GitHub - maliciousgroup/RDI-SRDI: This repo goes with the blog entry at blog.malicious.group entitled “Writing your own RDI / sRDI loader using C and ASM”.
This repo goes with the blog entry at blog.malicious.group entitled "Writing your own RDI / sRDI loader using C and ASM". - GitHub - maliciousgroup/RDI-SRDI: This repo goes with the blog…

With that being said, let's take a look at the prerequisites required to follow along with this paper.


Prerequisites

Reflective DLL Injection (RDI) and Shellcode Reflective DLL Injection (sRDI) are techniques used by attackers to load a DLL or shellcode into a process without traditional injection methods. RDI was introduced by Stephen Fewer in 2009, while sRDI was presented by Adam Chester in 2016 at the DerbyCon conference. Both researchers are acknowledged for their contribution in introducing these techniques to the public.

To fully grasp the examples in this post, it is essential to have a solid understanding of the PE file format and how it is typically loaded by Windows.

The following list outlines several key steps that Windows takes when loading a PE file. These are also the same steps we will need to take when writing our RDI loader.

  1. In this step, the binary representation of the DLL file is read from the file system or from memory. This typically involves opening the file, reading its contents, and storing the binary data in memory for further processing.

  2. The PE header of the DLL file is parsed to extract important information such as the size of the image. The PE header is a data structure located at the beginning of the DLL file that contains information about the organization and layout of the file, including the size of the different sections.

  3. In this step, memory is allocated in the address space of the target process to hold the binary data of the DLL. This is typically done using the VirtualAllocEx function, which reserves and commits a block of memory with the appropriate size to accommodate the entire DLL image. Sections are then iterated and copied to the newly allocated memory using the PIMAGE_SECTION_HEADER structure.

  4. The relocations, also known as fixups, are applied to the DLL image to adjust the addresses of the code and data within the image to match the base address at which the image is loaded in the target process. This step involves iterating through the relocation table in the PE header and applying the necessary address adjustments to the corresponding locations in the allocated memory.

  5. The Import Address Table (IAT) of the DLL is processed to resolve and update the import references to external functions and data. This step involves iterating through the IAT and resolving the import references by loading the required modules into the target process, obtaining the addresses of the imported functions or data, and updating the IAT with the resolved addresses.

  6. The protection settings of the different sections in the DLL image are applied to the corresponding memory pages in the allocated memory. This involves setting the appropriate memory protection flags, such as PAGE_EXECUTE_READWRITE for executable sections and PAGE_READWRITE for writable sections, using the VirtualProtectEx function.

  7. If the DLL has Thread Local Storage (TLS) callbacks, they are executed in the target process. TLS is a mechanism that allows each thread in a process to have its own storage for thread-specific data. TLS callbacks are functions that are executed when a thread is created or terminated, and they are typically used to initialize or clean up thread-specific data.

  8. Finally, the execution is handed over to the entry point of the DLL, which is the DllMain function. DllMain is a special function in the DLL that is automatically called by the operating system when the DLL is loaded or unloaded, and it is responsible for performing any necessary initialization or cleanup tasks specific to the DLL.

To ensure a thorough understanding of the steps mentioned above, it's crucial to imagine how each task can be executed with C code. If you're not confident about this yet or wish to gain more knowledge about the functioning of the PE format, I highly suggest taking up Xeno Kovah's OST2 courses or watching Dr. Josh Stroschein's videos on YouTube.

From this point on, I will also be covering the material with the assumption the reader has previous knowledge of at least some of the subject matter I am going to cover in the following sections.


Build Setup

For the development, I chose to utilize Jetbrains' CLion as my preferred IDE. Since I had previously purchased PyCharm Professional, I already had a Jetbrains account and found CLion to be a great tool for this task.

In the upcoming build process, my first step would involve creating a generic DLL solely intended for testing our RDI/sRDI loaders. After successfully creating the DLL, I will proceed to construct a fundamental RDI loader utilizing the standard Windows API, which will perform the injection of the test DLL created in the next step. Lastly, I will create a discreet RDI/sRDI loader featuring function hashing, obfuscated function pointers, and utilizing the Native API sourced from ntdll.dll.

For clarity, I will be breaking up the loaders code into the parts listed in the above section. I found this is the easiest way to show how the loader works in both the Standard API, vs the Native API.


DLL Creation

My first step using CLion as the IDE will be to create a new C project, which I will name dll_poc as shown in the example below.

After creating the dll_poc project, you should see the following image which will allow you to start editing the main.c file.

The following code should go into main.c to create and export the DllMain() function within the DLL file.

main.c


To make sure we compile this into a DLL instead of a EXE we need to modify the CMakeLists.txt file located in the project root directory. The file should look like the following.

CMakeLists.txt


Now with both files above modified, we can go ahead and build the project which should create a dll_poc.dll file within the generic cmake-debug-build directory.

With the dll_poc.dll file created, we can quickly test to make sure the DllMain function works without error by using rundll32.

With the DLL done, I am going to copy the DLL to C:/Temp/ for easy testing.

cp .\cmake-build-debug\dll_poc.dll C:\Temp\

Basic RDI Loader

With the DLL compiled and moved to C:/Temp/, it is time to start working on the loader, so again I will start by creating a new C project called dll_loader as seen in the following example.

After the new project has been created, open up a terminal within the IDE Alt + F12 and create the following folders. Depending on which command interpreter you are using with CLion, will determine which commands you will run in the following steps.


Make required folders

🖥️
Powershell
New-Item -ItemType Directory -Force -Path "src\c","src\h","src\masm"

🖥️
Cmd
mkdir src\c src\h src\masm

Once the folders above have been setup, the IDE file browser should look like the following image.

Now with the directories created, it is time to populate them with a few files using the following commands.


Make required files

🖥️
Powershell
New-Item -ItemType File -Path "src/c/peb.c", "src/h/peb.h", "src/masm/peb.masm"
New-Item -ItemType File -Path "src/h/defs.h", "src/h/structs.h"

🖥️
Cmd
echo.>src\c\peb.c
echo.>src\h\peb.h
echo.>src\masm\peb.masm
copy /b src\c\peb.c+,, src\h\peb.h+,, src\masm\peb.masm+,
echo.>src\h\defs.h
echo.>src\h\structs.h

After running the above commands to setup some blank files in their respective directories, you should see a file layout like the one in the below image.

Having created all the necessary file stubs, the next step is to create a basic DLL injection using main.c. For now, we'll use the example from ired.team which uses what I consider "high-level" Windows API functions to ensure the logic is sound. Once this is confirmed, we can move on to refining and optimizing the code.


Basic - Step 1

In this step, the binary representation of the DLL file is read from the file...

This code uses the CreateFileA function to open the DLL for reading, then allocates some space on the heap using HeapAlloc which will store the file being read by ReadFile.


Basic - Step 2

The PE header of the DLL file is parsed to extract important information...

This code uses the PIMAGE_DOS_HEADER and PIMAGE_NT_HEADERS structures to find the size of the DLL file, which is stored in the nt_headers->OptionalHeader.SizeOfImage member variable.


Basic - Step 3

Memory is allocated in the address space of the target process to hold the binary...

This code is responsible for copying the sections of the DLL from the file on disk into the memory block previously allocated for the DLL. It does this by using the PIMAGE_SECTION_HEADER and IMAGE_FIRST_SECTION structures to iterate over each section to be copied to allocated memory.


Basic - Step 4

The loader performs any necessary relocation fixups on the executable. Relocation...

This code is responsible for performing relocation of the DLL image to its new base address. It first gets the relocation data directory and calculates the relocation table's RVA. It then iterates over each block in the relocation table and calculates the address to patch based on the relocation type and offset by using both the PBASE_RELOCATION_ENTRY and PBASE_RELOCATION_BLOCK structures. It reads the original value at the address to patch, adds the delta image base to it, and writes it back to the same location to update the address to the new location. This process ensures that the DLL can run correctly at its new base address.


Basic - Step 5

The loader performs any necessary imports. An import is a reference to a function...

This code is responsible for loading the imported functions of a DLL. It first retrieves the address of the import directory, then loops through each import descriptor, loading the library containing the imported functions and resolving each imported function by either its name or ordinal value. It then sets the address of each imported function in the appropriate thunk table entry.


Basic - Step 6

Finally, the loader transfers control to the executable's entry point...

This code snippet gets the address of the DLL entry point using the AddressOfEntryPoint field of the OptionalHeader of the loaded DLL. It then casts this address to a function pointer of type DLLEntry, which is a type that represents the entry point of a DLL. The (*DllEntry) syntax dereferences the function pointer and calls the entry point function with the HINSTANCE of the loaded DLL, the DLL_PROCESS_ATTACH flag, and a value of 0 for the third parameter. Finally, the code releases the resources allocated for loading the DLL, including closing the handle to the file and freeing the memory allocated for storing the DLL bytes.


Basic Final

Taking all the steps above and putting it together will look like the following example.


After typing out the code in main.c, build the solution and run the executable. This should work with the default CMakeLists.txt that came with the project creation.

As you can see above, this example works. However, this is a very basic example of a RDI injection without any optimizations or obfuscation, and this example would not work in most secure environments. To see why, let's take a closer look by doing a little analysis on the binary as it is now.


First let's run the binary, and instead of clicking the "Ok" button to close it, let's open Process Hacker and take a look at the memory while it is running.

As you can see in the above image, that we currently have a RWX region in memory which is a dead give-away something suspect is happening in the process. We will need to make sure to fix this along with everything else in the upcoming version.


Next, let's open the file with CFF Explorer as seen below.

Here you can see that the dll_loader.exe file is importing KERNEL32.dll with 24 functions, and msvcrt.dll with 26 functions. You may be asking yourself why the hell are there 50 functions imported, when the code itself only uses a fraction of those? This is because the msvcrt.dll and higher-level Windows API functions also use underlying functions themselves.


Lastly, let's take a look at the binary using strings.exe from the Sys Internals tools.

C:\Temp>strings.exe C:\Maldev\evasion\dll_loader\cmake-build-debug\dll_loader.exe | find /c /v ""
1490

There are 1490 entries within the strings.exe output, and an example of some of the items we want to remove are seen below.

All the functions we are using are easily seen in plain-text in the above output. This is a problem for any malware developer and needs to be taken care of to make this loader more stealthy.

As you can see in the examples above, that the basic loader is very loud, and needs a lot of work. Now let's take a look at writing an obfuscated RDI using the Windows Native API, obfuscated function pointers, function name hashing and more.


Stealth Loader

In this version of the loader, we will utilize the files we generated previously to store custom structures, function definitions, and assembly code to improve the stealthiness of the RDI loader.

Using the same dll_loader project, comment out the current main() function for now so we can rewrite everything from scratch.

First, we need to create some helper functions for function hashing and function address resolution. To achieve this, we will incorporate a CRC hashing function from the peb.c file, as well as a get_proc_address_by_hash function that resolves function addresses based on the provided DLL.


Helper Functions

peb.c

As you can see in the code above, the crc32b function takes a string and returns a hash for storage, and the get_proc_address_by_hash function takes the base address of a DLL, and the hashed function name as input and returns the function address. We are also adding the peb.h header which includes variables required by crc32b as seen below.


peb.h


We will also move the structs within the old main.c to the structs.h header file as seen below.

structs.h


Next is to add some content to the defs.h file to define a few constants. This is also where all of our function definition will go as we progress further.

defs.h


It's time to familiarize ourselves with some fundamental MASM Assembly instructions. As we'll be utilizing the Native API for our loader, it's essential to know how to obtain the address of the ntdll.dll library to resolve Native API functions when required. Although this task can also be accomplished in C, learning ASM can aid in writing C code, which is why we'll be focusing on it.

peb.masm


The above code is walking the PEB (Process Environment Block) of the current 64-bit process to find the address of ntdll.dll. This is because the ntdll.dll DLL has the same base address across all processes.

Here are the steps of what is happening in the above code.

  1. xor rax, rax: This sets the value of the rax register to zero, which is a common way to initialize a register before using it.

  2. mov rax, gs:[60h]: This retrieves the address of the PEB from the gs segment register, which is a register used by Windows for thread-local storage. The offset 60h is the location of the PEB pointer within the TIB (Thread Information Block), which is another data structure used by Windows to store information about threads.

  3. mov rax, [rax + 18h]: This retrieves the address of the Ldr (Loader) member of the PEB, which is a pointer to a linked list of loaded modules.

  4. mov rax, [rax + 20h]: This retrieves the address of the first entry in the linked list, which corresponds to the main module of the process (i.e., the executable file itself).

  5. mov rax, [rax]: This retrieves the base address of the main module.

  6. mov rax, [rax + 20h]: This retrieves the address of the second entry in the linked list, which corresponds to ntdll.dll.

  7. ret: This returns the value of rax, which now contains the base address of ntdll.dll.

The instructions would be slightly different for a 32-bit application, but since I am using 64-bit I will be using the above version.

With the peb.masm creating a get_ntdll() function, we will need to add a line to the peb.h file to export the get_ntdll() function for usage

peb.h


As you can see above, we added the line 7 so we can access the get_ntdll() MASM function from our C code.

With the above helper functions created, the last thing to do before moving on, is to setup our CMakeLists.txt configuration to setup MASM compiling, as well as other features to make things easier.

CMakeLists.txt


In the above configuration, it will allow us to add our MASM files to be compiled and used as part of the C compilation process. It also allows each file within the project to add headers by name only, instead of adding the pathing for each. You will notice I moved the ml64.exe binary to the C:/Temp/ directory for the sake of testing, so you can either do the same, or use the full path to where ml64.exe exists within something like Visual Studio.

Now with the Cmake configuration done, we can move to the main.c file to start writing and setting up new functions and structures needed as we progress through main.c.


Within main.c, we are going to use the logic from each step in our basic loader, and rewrite it into Native API functions, with function pointer obfuscation.

The above code is Step 1 from our basic loader, and this code is basically just opening a file in read mode, reads the size of the file while allocating memory to store the data, before being read into the allocated memory via ReadFile function.

To rewrite these functions using the Native API, it's important to have a clear understanding of which functions are needed and how to set up function name hashes and function definitions for each of them. The following functions are required to replace those in the Step 1 code above.

  1. RtlInitUnicodeString - is required to create unicode strings
  2. NtCreateFile - is required to replace CreateFileA
  3. NtQueryInformationFile - is required to get information about file
  4. NtAllocateVirtualMemory - is required to allocate memory for the file
  5. NtReadFile - is required to read file contents

For these functions to work, we will need to add some new structures to our structs.h header file. This is due to the Native API calls relying on lower-level data structures that need to be added.

  1. UNICODE_STRING
  2. OBJECT_ATTRIBUTES
  3. IO_STATUS_BLOCK
  4. FILE_STANDARD_INFORMATION
  5. PIO_APC_ROUTINE

Because this is a common theme in malware development, I am going to show you the sources I use to get the definitions for both structures and functions, and how I hash the names.

I often refer to the ntdll.dll header file used in the x64dbg project as a point of reference. Instead of including the entire file in my projects, I believe it's important to understand which functions use which structures. By adding only the necessary components, I can reduce the size of my project and potentially decrease the surface area for analysis. This approach also allows for a better understanding of the functions being used, as well as providing a more focused and streamlined development process.

The first thing I do to resolve the function definitions is pull the function prototype from the ntdll.dll header file listed in the link above. After pulling the functions prototypes, I copy them as seen in the defs.h file below.

defs.h


If you are following along, you can see why we need the various structures to support the Native API function definitions as seen in the following image.

You can see in the IDE that items like PUNICODE_STRING or PIO_STATUS_BLOCK and others are not being resolved, so this means we need to populate those structures within the structs.h file as seen below.

structs.h


Now we need to define the function name hashes for the function names we currently want to use. We can achieve this by using a small printf function along with our HASH macro and then exit as seen below.

You then take those hashes and define them in the peb.h file so we can use them in the coming code.


Function Obfuscation

Now that the function hashes and definitions are created, I will show you how I am obfuscating the function pointers, and hiding the function names in the binary.

In the following example, there are three steps to using a Native API function when using function obfuscation.

Step 1: Make sure there is a pointer to the ntdll.dll base address, if not create it with get_ntdll(). Next, create a void pointer of the function name, and resolve the base address using get_proc_address_by_hash and using the hash we created in the previous steps as seen below.

The above pointers currently point to the base address to ntdll.dll, and the base address of NtCreateFile respectively.


Step 2: Cast the function pointer to its definition type.

The above code is casting the new variable g_nt_create_file as type NtCreateFile_t as we defined it defs.h.


Step 3: Use the function as normal, using our new function g_nt_create_file.

The above command is identical to the actual NtCreateFile function, but using a obfuscated function pointer.

Now that we have covered both the helper functions and how to apply function obfuscation, we can start working on Step 1 of our new RDI loader.


Stealth Step 1

In this step, the binary representation of the DLL file is read from the file...

The above code is the equivalent code to step 1 in our basic loader. This version includes Native API functions using function obfuscation and function name hashing. You can see the code went from 4 lines, to almost 40, and this doesn't count the structures and definitions we added.


Stealth Step 2

The PE header of the DLL file is parsed to extract important information...

This code uses the PIMAGE_DOS_HEADER and PIMAGE_NT_HEADERS structures to find the size of the DLL file.


Stealth Step 3

Memory is allocated in the address space of the target process to hold the binary...

The above code allocates virtual memory for a DLL and copies the DLL's header into the allocated memory, and then iterates through each section of the DLL, copying their contents into the allocated memory. The code calculates the difference between the base address of the allocated memory and the ImageBase address in the DLL's PE file header, and uses memcpy() for memory copying operations. Next the code iterates through the sections of a DLL, as defined in its PE file header.


Stealth Step 4

The loader performs any necessary relocation fixups on the executable. Relocation...

This code is responsible for performing relocation of the DLL image to its new base address. It first gets the relocation data directory and calculates the relocation table's RVA. It then iterates over each block in the relocation table and calculates the address to patch based on the relocation type and offset by using both the PBASE_RELOCATION_ENTRY and PBASE_RELOCATION_BLOCK structures. It reads the original value at the address to patch, adds the delta image base to it, and writes it back to the same location to update the address to the new location. This process ensures that the DLL can run correctly at its new base address.


Stealth Step 5

The loader performs any necessary imports. An import is a reference to a function...

The above code handles the dynamic linking of imported libraries in a Windows PE file. It iterates through the import descriptors in the PE file, loads the import libraries using the LdrLoadDll function, retrieves the addresses of imported functions using the LdrGetProcedureAddress function, and updates the import address table (IAT) with the resolved addresses. The code uses Windows-specific data types and structures, such as PIMAGE_IMPORT_DESCRIPTOR and UNICODE_STRING, and makes use of function pointers obtained through hash-based lookups to dynamically call functions from the NTDLL library, a Windows system library providing low-level functions for managing processes, threads, and memory.


Stealth Step 6

The protection settings of the different sections in the DLL image are applied...

The above code iterates through the sections of a DLL, as defined in its PE file header. For each section, it calculates the protection flags for the memory region where the section will be loaded, based on the characteristics of the section (such as executable, readable, and writable). It then calls the NtProtectVirtualMemory() function, obtained through a function pointer, to set the appropriate memory protection for the section.


Stealth Step 7

Flush the instruction cache and check if the TLS has entries to copy...

The code above flushes the instruction cache using NtFlushInstructionCache after processing the IAT. Next, the code checks if the PE file has a TLS (Thread Local Storage) directory entry in its optional header, using the IMAGE_DIRECTORY_ENTRY_TLS constant. If it does, it retrieves the TLS directory from the PE file, which contains a list of TLS callback functions (AddressOfCallBacks) to be executed during thread initialization.


Stealth Step 8

Finally, the loader transfers control to the executable's entry point...

The above code calculates the address of the DLLEntry function within the DLL, calls it with appropriate arguments, retrieves and casts function pointers for NtClose and NtFreeVirtualMemory functions from the ntdll module, and calls them with relevant arguments. These operations likely involve handling DLL initialization, closing handles or files, and releasing virtual memory allocated for the DLL.


Stealth Final

With all the code together, including all the new function definitions and hashes, it should look like the following.

structs.h


defs.h


peb.h


main.c


CMakeLists.txt


Once you have rewritten all the above code, you should be able to build and execute it to get the message box for verification that it worked.


Now that we know the logic works, let's take a closer look at the binary, starting first with Process Hacker as seen below.

As you can tell, there is no RWX region of memory anymore, which is exactly what we wanted to achieve by updating each section to the correct permissions as seen in Step 6 above.


Next, let's open the binary with CFF Explorer to check the import table now.

Although the import table appears improved, it has not yet reached our desired level. Before proceeding to write our sRDI, we must ensure that the loader is entirely position independent.

To accomplish this, we can remove the msvcrt.dll dependency all together by modifying the CMakeLists.txt to include the following lines.

The above additions to the CMakeLists.txt file will remove the C standard library, and change the entry point from main() to start(). However, by removing the dependency msvcrt.dll will require us to write our own strlen and memcpy functions to replace those used in our code.


At the top of the main.c file, we are going to add the two replacement functions for strlen and memcpy, as seen in the following code.

Search and replace every memcpy with a mc function, and replace every strlen with a sl function.

With those changes done, rebuild and run the binary to make sure it works. After verifying you get the message box, try re-opening the file within CFF Explorer again.

Bingo! No imports required to run, and the message box still pops showing that the DLL is being loaded using fully position independent code. Now that we have a fully PIC RDI loader, let's take a look at how we can turn this into a sRDI loader so that we don't have to rely on a on-disk/network DLL for loading and we can load the DLL straight from a stub.


Stealth sRDI Loader

Now that we have a fully position independent RDI loader, we can work on changing the first few steps in start() to achieve the loading of shellcode from a stub instead of opening and reading bytes from a file.

First thing we need to do is turn our DLL file into shellcode, and the pe_to_shellcode tool written by Hasherezade is perfect for this PoC.

GitHub - hasherezade/pe_to_shellcode: Converts PE into a shellcode
Converts PE into a shellcode. Contribute to hasherezade/pe_to_shellcode development by creating an account on GitHub.

After turning your dll_poc.dll into shellcode dll_poc.bin, you can use the xxd tool to make the header file for you to use.

xxd -i dll_poc.bin > dll.h

Then you simply add the dll.h header file to the src/h/ folder with the others.

The header file only includes two variables, dll_bin and dll_bin_len, which store the dll bytes and its size respectively.


With our new dll.h header file, we can go back to start() and change Step 1 to use the shellcode instead of the on-disk DLL file. Using the dll.h header file, our Step 1 goes from 35 lines of code, to a couple as seen below.

Replace Step 1

In this step, the binary representation of the DLL file is read from header stub...

To avoid printing all the other steps since the code is almost identical, I am going to print the new main.c showing that we removed 33 lines of code from Step 1 when using sRDI over RDI injection. There are also minor changes, but they should be obvious when looking at the following code.

Then once we rebuild and execute the loader, we can see that our sRDI was able to load the DLL from a header file instead of pulling a DLL from on-disk or over the network.


And with that, I am done showing you code examples. However, this code still has a lot of optimizations that could be done to make it more stealthy and evasive, like encrypting the DLL bytes in the header file, and decrypting before execution... or adding some sleep obfuscation to the DLL itself, etc...

I hope this paper gives you a decent understanding about how you can turn the standard Windows API into Native code, how to use function obfuscation, function name hashing, and how to reflectively load a DLL into memory.