Decoding Ethereum Events
Smart contracts have many differences from regular programs, logging is one of them. Logs in the Ethereum ecosystem aren’t just strings written to standard out, but rather a fundamental part of the Ethereum Virtual Machine (EVM) embedded in the protocol. They enable developers to not only understand what’s happening with a contract, but prove properties across networks thanks to receipt root proofs.
Contracts can emit events through 5 different opcodes, LOG0
through to LOG4
. All these opcodes let you record some region of memory1, and with the exception of LOG0
all the opcodes store topics which are indexed 256-bit parameters2. Generally the first indexed topic is used to denote the event type, which is obtained by hashing the signature of the event method (known as the event selector), however languages also provide the means to do “anonymous” events which do not set the first topic as the selector.
While the notion of topics may seem unimportant (since after all, if we can emit a region of memory, we can log a string, which is how most standard applications do things), in the Ethereum ecosystem, they are given special importance through some ERCs.
The most common ERCs you might have heard of are ERC-20 and ERC-777, which define the interface (functions and events) for contracts which want to adhere to the standard. ERC-20 defines the following events:
event Transfer(address indexed _from, address indexed _to, uint256 _value)
event Approval(address indexed _owner, address indexed _spender, uint256 _value)
and also specifies where, when, and how these events are to be emitted.
Given the above definition I know I have a Transfer
event when topic0 == 0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef
, and I should expect 3 topics (the selector and 2 addresses) and 1 word in memory (the value). Unfortunately, this isn’t always the case as the algorithm that computes the selector of events doesn’t account for the indexed parameters3, meaning that the following two lines are equivalent.
to_selector('Transfer(address indexed f, address indexed t, uint256 v)')
to_selector('Transfer(address indexed f, address indexed t, uint256 indexed v)')
This might not seem like a big issue, since you might only care about fully compliant tokens anyway. However, when developing our internal transaction debugger, I wanted to make it as resilient as possible to decode as many events as possible.
An initial solution to this problem might try to brute force index combinations, for the Transfer
event this is easy since its $^3C_2 = 3$ possible choices if we have 2 indexed arguments, and even if we move up to a larger event with $^6C_3$ it’s only 20 (5 bits is definitely not post-quantum). However this doesn’t provide us with the most useful data in the average case, since elements could be decoded out of order.
Take the following log
topic0: 0xddf2...3b3ef
topic1: 0xd8da...6045
memory: 0x0000...10f5...d2fe...2b5e3af16b1880000
We could decode it as:
Transfer(0xd8da...6045, 0x10f5...d2fe, 0x2b5e3af16b1880000)
or equally as:
Transfer(0x10f5...d2fe, 0xd8da...6045, 0x2b5e3af16b1880000)
This is because Transfer(indexed,_,_)
and Transfer(_, indexed, _)
are indistinguishable as the types of arguments 1 and 2 are the same (this also extends to the more general case). You can only determine which one it is if you have the source code. One of the goals when developing our debugger was to enable decoding for source-less contracts.
To solve this problem I decided to aggregate a list of known ABIs and then compute statistics over them. Gathering the ABIs themselves was simple as Etherscan provides an API to fetch verified ABIs, and thanks to Postgres’s support for JSON objects computing the statistics is a breeze.
create table events (
selector bytea primary key,
abi jsonb
);
with abis as (
select
selector,
(mode() within group (order by abi)) as _abi,
index_combinations(jsonb_agg(abi)) as indexes
from events
group by selector
)
select
selector,
common.construct_signature(_abi) as signature,
coalesce(abis.indexes, '[]'::jsonb) as indexes,
abi.*
from abis,
jsonb_to_record(_abi) as abi("name" text,
"type" text,
"inputs" jsonb,
"anonymous" bool)
The above query will generate a table that is a valid ABI entry with the most common argument names and two additional fields: selector, which is useful for lookups, and indexes, which is a sorted list of the most popular event combinations.
The structure of index_combinations is as follows:
with known_indexes as (
-- returns an boolean array where each element represents if an argument is indexed
select common.abi_indexed_event_mask(abi -> 'inputs') as indexes
from jsonb_array_elements(jsonb_abi) as j(abi)
),
perm_freq as (
select indexes as index_perm,
count(*) as cnt
from known_indexes
where indexes is not null
group by indexes
order by cnt desc
)
select jsonb_agg(index_perm)
from perm_freq
We can gather some statistics about this data to understand how often these different index combinations happen:
select count(*) as cnt_single_index_combination,
sum((jsonb_array_length(indexes) > 1)::int) as cnt_multi_index_combination,
avg(jsonb_array_length(indexes)) as avg_index_combinations,
stddev(jsonb_array_length(indexes)) as stddev_index_combinations,
max(jsonb_array_length(indexes)) as max_index_combinations,
from aggregated_events;
cnt_single | cnt_multi | avg | stddev | max |
---|---|---|---|---|
260005 | 13446 | 1.043 | 0.334 | 8 |
As well as the distribution of these index combinations:
select len, count(*)
from (select jsonb_array_length(indexes) as len
from aggregated_events) _
group by len;
len | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
count | 5125 | 241434 | 11194 | 1746 | 346 | 109 | 37 | 12 | 2 |
Circling back to the motivating Transfer
example reveals the following index combinations:
index_perm | cnt |
---|---|
{true,true,false} | 832171 |
{true,true,true} | 163522 |
{false,false,false} | 534 |
{true,false,false} | 47 |
{false,true,false} | 5 |
It seems that Transfer(indexed,_,_)
just beats out Transfer(_,indexed,_)
… maybe I should have just done it in-order.
-
Memory parameters are the remaining (un-indexed) log fields abi-encoded as a tuple ↩︎
-
The indexed parameters are primarily useful for efficient filtering which is accessible via
eth_getLogs
↩︎ -
This is also true for return parameters for functions. The processing we’re doing here could be applied to get a best approximation for return data decoding, however a better approach might include using type information from decompilers such as: Heimdall or Gigahorse ↩︎