Decoding Ethereum Events

May 24, 2024 [ethereum] [smart-contracts]

Observability is an often overlooked part of software development and subsequent deployment. I myself am guilty of chucking in a handful of unstructured log messages and calling it a day. As the complexity of the application increases, this usually ends up forcing our hands to implement either structured logging or perhaps a tracing system.

The standard way of observing smart contracts is through logs (also referred to as events). These are a bit more sophisticated than a standard print statement, and actually possess some additional properties. The most interesting of these properties is that we can prove certain events happened, as all the logs emitted during a block are Merkleized into the receipts root field of the block header. This actually allows us to prove that a certain event happened without needing to re-execute the block, and it is used quite heavily in bridging protocols, for example.

Contracts can emit events through five different opcodes, LOG0 through LOG4. All these opcodes let you record some region of memory¹, and, with the exception of LOG0, all the opcodes store topics, which are indexed 256-bit parameters². Generally, the first indexed topic is used to denote the event type, which is obtained by hashing the signature of the event method (known as the event selector); however, languages also provide the means to do “anonymous” events, which do not set the first topic as the selector.

The notion of topics may seem unimportant at first glance. After all, if we can emit a region of memory, we can simply log a string (which is how most standard applications handle things), but the Ethereum ecosystem gives topics special importance through certain ERCs.

The most common ERCs you might have heard of are ERC-20 and ERC-777, which define the interface (functions and events) for contracts that want to adhere to the standard. ERC-20 defines the following events:

event Transfer(address indexed _from, address indexed _to, uint256 _value) 
event Approval(address indexed _owner, address indexed _spender, uint256 _value)

It also specifies where, when, and how these events are to be emitted.

Given the above definition, I know I have a Transfer event when topic0 == 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef, and I should expect 2 more topics (the from and to addresses) and 1 word in memory (the value). Unfortunately, this isn’t always the case, as the algorithm that computes the event selector doesn’t account for the indexed flag on the parameters³, meaning that the following two lines are equivalent:

to_selector('Transfer(address indexed f, address indexed t, uint256 v)')
to_selector('Transfer(address indexed f, address indexed t, uint256 indexed v)')

This might not seem like a big issue, since you might only care about fully compliant tokens anyway. However, when developing our internal transaction debugger, I wanted to make it robust enough to decode as many events as possible.

An initial solution to this problem might try brute-forcing index combinations, for the Transfer event this is easy since its $^3C_2 = 3$ possible choices if we have 2 indexed arguments, and even if we move up to a larger event with $^6C_3$ it’s only 20 (5 bits is definitely not post-quantum). However this doesn’t provide us with the most useful data in the average case, since elements could be decoded out of order.

Take the following log

topic0: 0xddf2...3b3ef
topic1: 0xd8da...6045
memory: 0x0000...10f5...d2fe...2b5e3af16b1880000

We could decode it as:

Transfer(0xd8da...6045, 0x10f5...d2fe, 0x2b5e3af16b1880000)

or equally as:

Transfer(0x10f5...d2fe, 0xd8da...6045, 0x2b5e3af16b1880000)

This is because Transfer(indexed,_,_) and Transfer(_, indexed, _) are indistinguishable, as the types of arguments 1 and 2 are the same. To determine the correct decoding, you have to look at the source code, and one of the goals when developing the debugger was to enable debugging of source-less contracts.

To solve this I aggregated a list of known ABIs and computed some statistics over them. Gathering the ABIs themselves was simple, as Etherscan provides an API to fetch verified ABIs. Computing the statistics was quite easy, as modern versions of Postgres’s have good support for JSON objects.

create table events (
	selector bytea primary key,
	abi      jsonb
);

with abis as (
        select
            selector,
            (mode() within group (order by abi)) as _abi,
            index_combinations(jsonb_agg(abi)) as indexes
        from events
        group by selector
)
select
    selector,
    common.construct_signature(_abi) as signature,
    coalesce(abis.indexes, '[]'::jsonb) as indexes,
    abi.*
from abis, 
     jsonb_to_record(_abi) as abi("name" text, 
                                  "type" text, 
                                  "inputs" jsonb, 
                                  "anonymous" bool)

The above query will generate a table that is a valid ABI entry with the most common argument names and two additional fields: selector, which is useful for lookups, and indexes, which is a sorted list of the most popular event combinations.

The structure of index_combinations is as follows:

with known_indexes as (
        -- returns a boolean array where each element represents if an argument is indexed
        select common.abi_indexed_event_mask(abi -> 'inputs') as indexes
        from jsonb_array_elements(jsonb_abi) as j(abi)
     ),
     perm_freq as (
        select indexes as index_perm, 
               count(*) as cnt
        from known_indexes
        where indexes is not null
        group by indexes
        order by cnt desc
     )
select jsonb_agg(index_perm)
from perm_freq

We can gather some statistics about this data to understand how often these different index combinations happen:

select count(*)                                    as cnt_single_index_combination,
       sum((jsonb_array_length(indexes) > 1)::int) as cnt_multi_index_combination,
       avg(jsonb_array_length(indexes))            as avg_index_combinations,
       stddev(jsonb_array_length(indexes))         as stddev_index_combinations,
       max(jsonb_array_length(indexes))            as max_index_combinations
from aggregated_events;

cnt_single	cnt_multi	avg	stddev	max
260005	13446	1.043	0.334	8

As well as the distribution of these index combinations:

select len, count(*)  
from (select jsonb_array_length(indexes) as len  
      from aggregated_events) _  
group by len;

len	0	1	2	3	4	5	6	7	8
count	5125	241434	11194	1746	346	109	37	12	2

Circling back to the motivating Transfer example reveals the following index combinations:

index_perm	cnt
{true,true,false}	832171
{true,true,true}	163522
{false,false,false}	534
{true,false,false}	47
{false,true,false}	5

It seems that Transfer(indexed,_,_) just beats out Transfer(_,indexed,_) … maybe I should have just done it in-order.

Memory parameters are the remaining (un-indexed) log fields ABI-encoded as a tuple ↩︎
The indexed parameters can be used for efficient filtering via eth_getLogs ↩︎
This is also true for return parameters for functions. The processing we’re doing here could be applied to get a best approximation for return data decoding, however a better approach might include using type information from decompilers such as: Heimdall or Gigahorse ↩︎