_aLfa_ Site Admin
Joined: 21 Sep 2002 Posts: 233 Location: Aveiro, Portugal
|
Posted: Wed Sep 08, 2004 11:26 am
Post subject: VB: Wearing the Inside Out
|
|
Richard Marko
ESET Software, Slovakia
With a user base of around five million, Visual Basic (VB) is probably the most widely used programming language in the world. Though not very suitable for 'hard-core' professionals, VB makes programming very simple and this is why it is so popular with beginners. Not surprisingly, we see a lot of worms, backdoors and other kinds of malware written in Visual Basic.
VB versions 5.0 and 6.0 are able to create two types of executable: P-Code and native code (both require VB runtime DLL to execute). The majority of projects written in VB (including malware) are compiled to native code simply because it is the default option. However, it is very simple to change it to P-Code, so there is good reason to study the internals of such executables. Understanding pcode will also help us to gain a better understanding of native code.
An article by Andy Nikishin and Mike Pavlyushchik, in the January 2002 issue of Virus Bulletin (see VB, January 2002, p.6 ) presents a good introduction to the internal structures of VB executables, especially those compiled to native code. In this article, executables compiled to P-Code will be examined more closely.
P-Code Basics
P-Code, or pseudo code, is a set of stack-oriented CPU-independent instructions, an intermediate step between the high-level statements in Basic program and the low-level native code instructions executed by a computer's processor. It can be interpreted either by VB virtual machine (in case the project is compiled to P-Code) or converted to a native code and optimized (in case the project is compiled to native code). It has instructions for loading, storing, initializing, object method calling, many instructions for arithmetic and logical operations, control flow and so on.
To avoid any misunderstandings in terminology, it should be noted that the P-Code generated by Visual Basic for Application (VBA) differs from that generated by VB. While the P-Code of VBA is basically the pre-processed source code stored in a special ready-to-interpret form, P-Code of VB is a result of true compilation. It is, therefore, comparable to MSIL (Microsoft Intermediate Language) used by Microsoft's .NET Framework.
P-Code Internals
The internal structures of VB executables are not documented. Fortunately, however, debug information for VB runtime DLL is available. This way we can learn the names of P-Code instructions and, generally, it makes the analysis easier.
To see how things really work we will first examine a simple 'Hello World!' project with one module and the following source code:
[vb]Sub Main() MsgBox "Hello World!"End Sub[/vb]
We set 'Compile to P-Code' in project properties and compile it. After the PE executable is built we find a wellknown stub at the entry point:
[asm]00401038 push offset EXEPROJECTINFO0040103D call MSVBVM60:ThunRTMain[/asm]
The ThunRTMain function exported by VB runtime DLL reads the EXEPROJECTINFO structure and starts VB project initialization. The structure contains a lot of important information, such as the address of the Main() function. The VB runtime will eventually call this address (if it is defined). Examining the code there we see:
[asm]004011CC mov edx,00040175C004011D1 mov ecx,000401032004011D6 jmp ecx...00401032 jmp MSVBVM60:ProcCallEngine[/asm]
The code loads the address of a structure to edx and calls the P-Code interpretation engine (ProcCallEngine) in the VB runtime library. The structure in edx (let us call it ProcDescription) contains all the important information about the Main() function, including the address of the corresponding module description structure, the size of local variables and the size of all P-Code instructions for this function.
The ProcCallEngine loads various internal data, sets the error handler, allocates sufficient space on the stack and then starts processing the P-Code instructions:
[asm]ProcCallEngine:...66104abe mov ebx,edx...66104b7f movzx esi,word ptr [ebx+08] ; P-Code size66104b83 neg esi66104b85 add esi,ebx...66104b8d xor eax,eax66104b8f mov al,[esi]66104b91 inc esi66104b92 jmp dword ptr [tblDispatch+eax*4][/asm]
From the code above we see the P-Code instructions precede the ProcDescription structure directly. The first byte of every instruction encodes the instruction type. This leads to 256 different encodings, though not all of these are used. Furthermore, opcodes 0xfb - 0xff, named Lead0 - Lead4, have a second opcode byte and thus constitute a new set of 256 encodings each. Consequently, there are hundreds of different P-Code instructions. The opcode bytes may be followed by one or more arguments. The majority of the instructions have a fixed number of arguments but there are a few exceptions.
Knowing all that, we can inspect the P-Code for our example Main() function itself:
Code: | 0000 27FCFE LitVar_Missing loc_0104 ; Context
0003 271CFF LitVar_Missing loc_00E4 ; HelpFile
0006 273CFF LitVar_Missing loc_00C4 ; Title
0009 F500000000 LitI4 00000000 ; Buttons
000E 3A6CFF0000 LitVarStr loc_0094, "Hello World!"
0013 4E5CFF FStVarCopyObj loc_00A4
0016 045CFF FLdRfVar loc_00A4 ; Prompt
0019 0A01001400 ImpAdCallFPR4 rtcMsgBox,14
001E 3608005CFF3CFF1C FFreeVar loc_00A4,loc_00C4,loc_00E4,loc_0104
0029 14 ExitProc
|
Let us explain this code briefly. We already mentioned that the VB P-Code instructions are stack-oriented. The operation of an instruction can be derived from its name, but often it is necessary to inspect the actual code in the runtime DLL that interprets it.
The name of the first instruction, LitVar_Missing, can be divided into three parts. 'Lit' states that a value will be pushed onto the top of the stack. 'Var', which stands for 'Variant', not 'variable', tells us that the instruction works with a variable of the 'Variant' type (Variant is a special data type that can contain any standard kind of data). Finally, 'Missing' specifies that the variable will be loaded with a special DISP_E_PARAMNOTFOUND value.
The instruction takes one argument which is the offset of a local variable from the top of the stack. It fills it with the value and pushes its address onto the stack (the Variant type occupies more than four bytes in memory and therefore is not pushed directly onto the stack; instead a reference to it is added). LitI4 takes the I4 (Long) argument and, as expected, pushes it onto the stack also.
The second argument of the LitVarStr instruction is an index to a special table (ReferenceTable) which contains the addresses of various data such as strings, imported functions, COM objects and so on.
Every module in VB project has its own ReferenceTable. In the case of LitVarStr the corresponding address in this table points to a string in BSTR format (Unicode zero-terminated string preceded by its length). The string is contained in the read-only '.text' section of the PE executable, so a copy of it is created first by FStVarCopyObj. A reference to the copy is then pushed onto the stack by the FLdRfVar instruction as the first (Prompt) parameter of the MsgBox() function. We did not specify the remaining four parameters (Buttons, Title, HelpFile and Context), so they were generated automatically. The last three are loaded by LitVar_Missing and the Buttons parameter is vbOKOnly (defined as 0) by default.
The ImpAdCallFPR4 instruction takes two arguments. The first is an index to ReferenceTable, the second is the size of parameters on stack. The instruction takes an address from ReferenceTable and sends a call there. The code at this address follows:
[asm]00401020 jmp MSVBVM60:rtcMsgBox[/asm]
As can be seen, the code simply jumps to the routine statically imported from VB virtual machine DLL. After rtcMsgBox returns, the FFreeVar instruction checks all the Variant variables used and frees any allocated resources (in our example the copy of the 'Hello World!' string). Finally, ExitProc does all the necessary de-initializations, removes the error handler, restores the stack (removes all possible arguments) and returns to the caller of ProcCallEngine.
The example above has shown that, from the VB runtime point of view, there is not a big difference between a project compiled to native code and P-Code. It can call P-Code functions just like native code and the small chunks of code do all the necessary work to ensure the P-Code will be interpreted correctly.
This example was very simple. Most VB projects have no Main() function; instead they have a starting form. Finding the entry points of the starting form events (for example 'Load()') is more complicated and requires knowledge of several data structures which can be traced from the EXEPROJECTINFO structure.
Imported Functions
Applications written in VB are able to call external procedures in DLLs. These can be imported either statically (during load time via the PE import section) or dynamically (during run time). Usually the procedures of the VB runtime library are imported statically. This includes procedures like ThunRTMain or ProcCallEngine, which are referenced automatically by VB compiler, as well as a special set of functions available to the VB programmer. They can be used in code without explicit declaration.
The names of these functions have the 'rtc' prefix (the prefix is added by the compiler, so it does not appear in the source code). Some of them are interesting in terms of generic malware detection - rtcCreateObject, rtcFileCopy and so on. The way these functions are called was illustrated in the previous example.
Dynamic Imports
Of more interest are calls to external API functions declared using the 'Declare' statement. The information about such functions is stored in a special table (ImportTable). Each entry contains the name of the function and DLL together with a short piece of native code that invokes the function. The following example illustrates the use of external functions. A similar code can be found in many backdoors:
[vb]Public Declare Function GetCurrentProcessId _ Lib "kernel32" () As LongPublic Declare Function RegisterServiceProcess _ Lib "kernel32" (ByVal dwProcessID As Long, _ ByVal dwType As Long) As Long Public Sub MakeMeService() Dim pid As Long Dim regserv As Long On Error Resume Next pid = GetCurrentProcessId() regserv = RegisterServiceProcess(pid, 1)End Sub[/vb]
This source code compiles to:
Code: | 0000 0002 LargeBos 00020002 0005 LargeBos 00070004 4BFFFF OnErrorGoto Resume Next0007 0011 LargeBos 00180009 5E00000000 ImpAdCallI4 kernel32:GetCurrentProcessId,00000E 7170FF FStI4 loc_00900011 3C SetLastSystemError0012 6C70FF ILdRf loc_00900015 7178FF FStI4 loc_00880018 0019 LargeBos 0031001A F501000000 LitI4 00000001001F 6C78FF ILdRf loc_00880022 5E01000800 ImpAdCallI4 kernel32:RegisterServiceProcess,080027 7170FF FStI4 loc_0090002A 3C SetLastSystemError002B 6C70FF ILdRf loc_0090002E 7174FF FStI4 loc_008C0031 0000 LargeBos 00310033 14 ExitProc |
Using the 'On Error Resume Next' statement is a little trick. It forces VB compiler to put the LargeBos instruction in front of the compilation of every statement, so we can easily see how a particular statement is compiled.
The loc_0088 is the pid variable and loc_008C is regserv. The way both API functions are called is very similar to the way the rtcMsgBox function was called in the previous example (although, now, both functions return Long value so it is reflected in the mnemonic code of the ImpAdCallI4 instruction). However, the associated code (remember its address is in ReferenceTable) is different:
[asm]0040134C mov eax,[004022E4]00401351 or eax,eax00401353 je 0040135700401355 jmp eax00401357 push 004013340040135C mov eax,0040102000401361 call eax00401363 jmp eax...00401020 jmp MSVBVM60:DllFunctionCall[/asm]
First the code checks whether the address of the imported function has been located already. If it has, it simply jumps there, otherwise the DllFunctionCall function is invoked with a pointer to the entry of ImportTable as an argument. DllFunctionCall reads the name of the imported function and DLL from there and tries to get the address using LoadLibraryA and GetProcAddress API. If successful it will save this address (to variable at 4022E4 in this case) and return it. If not, it invokes the error handler internally.
COM Technology
One of the reasons why VB is so popular must be the fact that it makes programming with COM objects so simple. Often a program that would take a lot of coding (and perhaps cause a lot of headaches) in languages like C++ can be written in a few lines of VB. Of course we pay a price for the comfort but sometimes it is very elegant.
In contrast to the VBA or VBScript members of the Visual Basic family, VB supports late binding as well as both types of early binding - vtable and DispID bindings.
Late Binding
While there is usually no need to use pricey late binding in VB, source codes of the infamous Melissa (VBA) and LoveLetter (VBScript) worms are so popular that the majority of VB email worms we see use this kind of binding. It means that no type library information about the COM object used is available during compilation so this must be collected during run time.
The example will show the way a few lines of code in VB are compiled to P-Code. They use MS Outlook COM object to create an email. As we have selected no additional references in the VB project settings, nor have we explicitly defined the type of object variables, the compiler has no other option than to use the late binding. The source code:
[vb]...Set out = CreateObject("Outlook.Application")Set mail = out.CreateItem(0)mail.Recipients.Add "marko@eset.sk"...[/vb]
The corresponding P-Code:
Code: | ...0007 001A LargeBos 00210009 F500000000 LitI4 00000000000E 1B0000 LitStr "Outlook.Application"0011 046CFF FLdRfVar loc_00940014 0A01000C00 ImpAdCallFPR4 rtcCreateObject2,0C0019 046CFF FLdRfVar loc_0094001C 045CFF FLdRfVar loc_00A4001F FE4E SetVarVarFunc0021 0018 LargeBos 00390023 284CFF0000 LitVarI2 loc_00B4,00000028 25 PopAdLdVar0029 045CFF FLdRfVar loc_00A4002C FF426CFF02000100 VarLateMemCallLdVar loc_0094,CreateItem,10034 042CFF FLdRfVar loc_00D40037 FE4E SetVarVarFunc0039 001C LargeBos 0055003B 3A4CFF0300 LitVarStr loc_00B4,"marko@eset.sk"0040 25 PopAdLdVar0041 042CFF FLdRfVar loc_00D40044 FF3D1CFF0400 VarLateMemLdRfVar loc_00E4,Recipients004A FD9F LdPrVar004C FE9805000100 LateMemCall Add,10052 351CFF FFree1Var loc_00E4... |
All the communication with the COM object is realized via the IDispatch interface. The rtcCreateObject2 function converts input ProgID to CLSID, creates an instance of the specified class and requests a pointer to the IDispatch interface.
The methods and properties of this interface are then invoked by 'LateMem' instructions (i.e. their names contain the 'LateMem' string). All take a method/property name as an argument, convert it internally to DispID using IDispatch::GetIDsOfNames and then invoke the method/ property using IDispatch::Invoke.
There are several 'LateMem' instructions. Methods are invoked by the 'LateMemCall' instructions, 'LateMemLd' instructions read properties and 'LateMemSt' write properties. If the instruction name is preceded by 'Var', the instruction takes the pointer to the class instance from the reference at the top of the stack. Otherwise, the pointer is read from a special internal variable set by the LdPrVar instruction (actually there are more 'Pr' instructions). On the other hand, if 'Var' is appended to the name, the instruction stores a result value in the specified variable. The 'LateMemCall' instructions also receive the number of arguments for the invoked method while the arguments themselves are prepared on the stack (PopAdLdVar).
VTable Binding
In the fastest form of early binding, vtable binding, Visual Basic uses an offset into a virtual function table (vtable). It needs to know the layout of vtable, IIDs of interfaces used etc., so appropriate references need to be selected in the project properties. The following code can be found in many worms, backdoors and other malware:
[vb]PathName = App.Path & "" & App.EXEName & ".exe"[/vb]
While the source code is pretty simple, the compiled code is a little more complicated and lengthy:
Code: | 0000 0474FF FLdRfVar loc_008C0003 0478FF FLdRfVar loc_00880006 050000 ImpAdLdRf Global:VBGlobal0009 240100 NewIfNullPr Global:VBGlobal000C 0D14000200 VCallHresult VBGlobal:App0011 0878FF FLdPr loc_00880014 0D50000300 VCallHresult _App:Path0019 6C74FF ILdRf loc_008C001C 1B0400 LitStr ""001F 2A ConcatStr0020 2368FF FStStrNoPop loc_00980023 046CFF FLdRfVar loc_00940026 0470FF FLdRfVar loc_00900029 050000 ImpAdLdRf Global:VBGlobal002C 240100 NewIfNullPr Global:VBGlobal002F 0D14000200 VCallHresult VBGlobal:App0034 0870FF FLdPr loc_00900037 0D58000300 VCallHresult _App:EXEName003C 6C6CFF ILdRf loc_0094003F 2A ConcatStr0040 2364FF FStStrNoPop loc_009C0043 1B0500 LitStr ".exe"0046 2A ConcatStr0047 4644FF CVarStr004A FCF654FF FStVar... |
In vtable binding methods and properties are invoked by the 'VCall' instructions. All these instructions take an offset to vtable as the first argument. They take the pointer to the class instance from the internal variable (see previous section) and invoke a method/property at the specified offset in vtable.
Since the return value is usually HRESULT the VCallHresult instruction is used most often. This is lucky indeed because this particular instruction takes a second argument. It is an index to ReferenceTable and points to the corresponding interface IID. The instruction verifies the returned HRESULT status internally and if it fails it invokes an error handler. The IID is supplied as an argument to the error handler. With both the vtable offset and the interface IID, it is relatively easy to look up a method/property name.
DispID Binding
The slower form of early binding is more similar to late binding and is seldom used in VB applications. The Automation COM components usually support dual interfaces (i.e. support both vtable and DispID binding), and VB compiler uses vtable binding whenever possible. There is a set of the 'LateId' instructions which are analogous to the 'LateMem' instructions, but they take the DispID argument instead of method/property name. There is probably no easy way to force VB to use DispID binding, but it is possible. For example, we modified the MS Outlook type library where we changed some of the dual interfaces to pure dispinterfaces. However, since this is clearly a 'laboratory' result we will not discuss it here.
Conclusion
The study of the internals of Visual Basic executables is a rather complex issue. It is also a time-consuming endeavour, since the analysis requires extensive work with debuggers, disassemblers, hex browsers and similar tools. Visual Basic, being one of the most popular programming languages in the world, may well justify all the effort. The outline of our approach indicates that time-consuming and detailed analysis may, in fact, provide important clues to the VB application's interaction with the outside environment and prove to be instrumental in the development of efficient heuristic rules to identify malicious software generically. The future will show how successful we will be.
Date: June 2002
Source: Virus Bulletin
|
|