Decoding Ethereum Events

May 24, 2024 [ethereum] [smart-contracts]

Smart contracts have many differences from regular programs, logging is one of them. Logs in the Ethereum ecosystem aren’t just strings written to standard out, but rather a fundamental part of the Ethereum Virtual Machine (EVM) embedded in the protocol. They enable developers to not only understand what’s happening with a contract, but prove properties across networks thanks to receipt root proofs.

Contracts can emit events through 5 different opcodes, LOG0 through to LOG4. All these opcodes let you record some region of memory¹, and with the exception of LOG0 all the opcodes store topics which are indexed 256-bit parameters². Generally the first indexed topic is used to denote the event type, which is obtained by hashing the signature of the event method (known as the event selector), however languages also provide the means to do “anonymous” events which do not set the first topic as the selector.

While the notion of topics may seem unimportant (since after all, if we can emit a region of memory, we can log a string, which is how most standard applications do things), in the Ethereum ecosystem, they are given special importance through some ERCs.

The most common ERCs you might have heard of are ERC-20 and ERC-777, which define the interface (functions and events) for contracts which want to adhere to the standard. ERC-20 defines the following events:

event Transfer(address indexed _from, address indexed _to, uint256 _value) 
event Approval(address indexed _owner, address indexed _spender, uint256 _value)

and also specifies where, when, and how these events are to be emitted.

Given the above definition I know I have a Transfer event when topic0 == 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef, and I should expect 3 topics (the selector and 2 addresses) and 1 word in memory (the value). Unfortunately, this isn’t always the case as the algorithm that computes the selector of events doesn’t account for the indexed parameters³, meaning that the following two lines are equivalent.

to_selector('Transfer(address indexed f, address indexed t, uint256 v)')
to_selector('Transfer(address indexed f, address indexed t, uint256 indexed v)')

This might not seem like a big issue, since you might only care about fully compliant tokens anyway. However, when developing our internal transaction debugger, I wanted to make it as resilient as possible to decode as many events as possible.

An initial solution to this problem might try to brute force index combinations, for the Transfer event this is easy since its $^3C_2 = 3$ possible choices if we have 2 indexed arguments, and even if we move up to a larger event with $^6C_3$ it’s only 20 (5 bits is definitely not post-quantum). However this doesn’t provide us with the most useful data in the average case, since elements could be decoded out of order.

Take the following log

topic0: 0xddf2...3b3ef
topic1: 0xd8da...6045
memory: 0x0000...10f5...d2fe...2b5e3af16b1880000

We could decode it as:

Transfer(0xd8da...6045, 0x10f5...d2fe, 0x2b5e3af16b1880000)

or equally as:

Transfer(0x10f5...d2fe, 0xd8da...6045, 0x2b5e3af16b1880000)

This is because Transfer(indexed,_,_) and Transfer(_, indexed, _) are indistinguishable as the types of arguments 1 and 2 are the same (this also extends to the more general case). You can only determine which one it is if you have the source code. One of the goals when developing our debugger was to enable decoding for source-less contracts.

To solve this problem I decided to aggregate a list of known ABIs and then compute statistics over them. Gathering the ABIs themselves was simple as Etherscan provides an API to fetch verified ABIs, and thanks to Postgres’s support for JSON objects computing the statistics is a breeze.

create table events (
	selector bytea primary key,
	abi      jsonb
);

with abis as (
        select
            selector,
            (mode() within group (order by abi)) as _abi,
            index_combinations(jsonb_agg(abi)) as indexes
        from events
        group by selector
)
select
    selector,
    common.construct_signature(_abi) as signature,
    coalesce(abis.indexes, '[]'::jsonb) as indexes,
    abi.*
from abis, 
     jsonb_to_record(_abi) as abi("name" text, 
                                  "type" text, 
                                  "inputs" jsonb, 
                                  "anonymous" bool)

The above query will generate a table that is a valid ABI entry with the most common argument names and two additional fields: selector, which is useful for lookups, and indexes, which is a sorted list of the most popular event combinations.

The structure of index_combinations is as follows:

with known_indexes as (
        -- returns an boolean array where each element represents if an argument is indexed
        select common.abi_indexed_event_mask(abi -> 'inputs') as indexes
        from jsonb_array_elements(jsonb_abi) as j(abi)
     ),
     perm_freq as (
        select indexes as index_perm, 
               count(*) as cnt
        from known_indexes
        where indexes is not null
        group by indexes
        order by cnt desc
     )
select jsonb_agg(index_perm)
from perm_freq

We can gather some statistics about this data to understand how often these different index combinations happen:

select count(*)                                    as cnt_single_index_combination,  
       sum((jsonb_array_length(indexes) > 1)::int) as cnt_multi_index_combination,  
       avg(jsonb_array_length(indexes))            as avg_index_combinations,  
       stddev(jsonb_array_length(indexes))         as stddev_index_combinations,  
       max(jsonb_array_length(indexes))            as max_index_combinations,
from aggregated_events;

cnt_single	cnt_multi	avg	stddev	max
260005	13446	1.043	0.334	8

As well as the distribution of these index combinations:

select len, count(*)  
from (select jsonb_array_length(indexes) as len  
      from aggregated_events) _  
group by len;

len	0	1	2	3	4	5	6	7	8
count	5125	241434	11194	1746	346	109	37	12	2

Circling back to the motivating Transfer example reveals the following index combinations:

index_perm	cnt
{true,true,false}	832171
{true,true,true}	163522
{false,false,false}	534
{true,false,false}	47
{false,true,false}	5

It seems that Transfer(indexed,_,_) just beats out Transfer(_,indexed,_) … maybe I should have just done it in-order.

Memory parameters are the remaining (un-indexed) log fields abi-encoded as a tuple ↩︎
The indexed parameters are primarily useful for efficient filtering which is accessible via eth_getLogs ↩︎
This is also true for return parameters for functions. The processing we’re doing here could be applied to get a best approximation for return data decoding, however a better approach might include using type information from decompilers such as: Heimdall or Gigahorse ↩︎