PVM Stack Underflow: Deep Dive & Fix

by Ahmed Latif 37 views

Hey guys! Today, we're diving deep into a tricky issue: a stack addressing underflow in PVM execution. This issue popped up while running v2-periphery-polkadot test cases with the latest revive-dev-node and resolc-0.3.0, causing tests to fail with those pesky "missing revert data" errors. Let's break down what happened, how we found it, and what it all means.

Understanding the Problem: Stack Addressing Underflow

Stack addressing underflow in the Parity Virtual Machine (PVM) is the core problem we're tackling. This occurs when the PVM attempts to access a memory address that is outside the allocated stack space, specifically below the valid range. Think of it like trying to withdraw money from your bank account when you don't have enough funds – the transaction fails, right? Similarly, when the PVM tries to access memory outside its stack, it leads to a trap, halting execution and causing errors. This kind of issue can be a real headache, especially in smart contract execution, as it can lead to unpredictable behavior and failed transactions. Identifying the root cause is crucial to prevent future occurrences and ensure the stability and reliability of the blockchain.

The errors we encountered manifested as "missing revert data," which is a common symptom when a smart contract transaction fails unexpectedly. This is because the contract's execution reverts without providing a clear reason, leaving us in the dark about what went wrong. Pinpointing the exact cause of these errors requires careful debugging and analysis, which we'll get into shortly. The underlying issue, as we discovered, was the PVM's attempt to write data to an invalid memory location on the stack. This is a critical issue because it directly impacts the integrity of the smart contract's execution environment. When the PVM encounters a stack addressing underflow, it means that the contract is trying to access memory that it shouldn't, potentially leading to data corruption or unpredictable behavior. This can have serious consequences for the contract's functionality and security.

The impact of stack addressing underflow can range from minor inconveniences to major security vulnerabilities. In our case, it caused test failures, which is a good thing because it allowed us to identify and address the issue before it could affect real-world deployments. However, if such an issue were to occur in a production environment, it could lead to much more serious consequences. For instance, it could cause a smart contract to behave in unexpected ways, potentially leading to financial losses or even security breaches. Therefore, it's crucial to have robust mechanisms in place to detect and prevent stack addressing underflows. This includes thorough testing, careful code reviews, and the use of formal verification techniques. By proactively addressing these issues, we can ensure the reliability and security of our smart contracts and the blockchain ecosystem as a whole.

Setting the Stage: Our Environment and Tools

To reproduce this issue, we need to understand the environment where it occurred. We were using:

  • polkadot-sdk version: c40b36c3a7c208f9a6837b80812473af3d9ba7f7 – This is the specific version of the Polkadot SDK we were working with. It's crucial to note this because bugs can be specific to certain versions.
  • Test repository: https://github.com/papermoonio/v2-periphery-polkadot/tree/bugfix/flaky – This is the repository where the failing tests reside. The bugfix/flaky branch specifically contains the code that exhibits the issue.
  • hardhat-polkadot: 0.1.5 and 0.1.8 – We used these versions of hardhat-polkadot, a tool for testing and deploying smart contracts on Polkadot.
  • resolc: 0.3.0 – This is the version of resolc, a resolver for smart contract dependencies, that we were using.

Having the correct versions is super important for reproducing bugs. Different versions might have different code paths or fixes that could mask the issue.

To effectively debug this stack addressing underflow, we also leveraged the polkavm=trace logging feature. This feature provides detailed logs of the PVM's execution, allowing us to trace the exact sequence of instructions and memory accesses. By examining these logs, we were able to pinpoint the specific instruction that caused the underflow and understand the context in which it occurred. This level of detail is invaluable when debugging complex issues like stack addressing underflows, as it allows us to see exactly what the PVM is doing at each step of the execution. The polkavm=trace logging feature is a powerful tool for developers working with the PVM, providing insights into the inner workings of the virtual machine and helping to identify and resolve potential issues. It's like having a magnifying glass that allows you to zoom in on the PVM's execution and see every detail.

Understanding the test environment is also critical. In our case, we were using a local network instead of a hardhat node due to some flaky behavior we encountered (more on that in #281). This decision highlights the importance of adapting your testing strategy to the specific challenges you face. Sometimes, using a local network can provide a more stable and predictable testing environment, especially when dealing with complex issues like stack addressing underflows. This allows you to isolate the problem and focus on debugging it without being distracted by external factors or flakiness in the testing environment. The choice of testing environment can significantly impact the efficiency and effectiveness of your debugging efforts, so it's essential to carefully consider your options and choose the one that best suits your needs.

Reproducing the Bug: Step-by-Step

Alright, let's get our hands dirty and reproduce this stack addressing underflow! Here's what we did:

1. Build the Nodes

First, we needed to build the necessary nodes. We cloned the polkadot-sdk repository:

git clone https://github.com/paritytech/polkadot-sdk
cd polkadot-sdk
git checkout c40b36c3a7c208f9a6837b80812473af3d9ba7f7

Then, we built the eth-rpc and the revive-dev-node:

# Build eth-rpc
cargo build -p pallet-revive-eth-rpc --release

# Build revive-dev-node
cd substrate/frame/revive/dev-node/node
cargo build --release

This building process ensures we have the correct versions of the necessary components to run our tests. Building from source guarantees that we're using the exact code that was present when the issue occurred. This is crucial for reproducibility, as different builds might contain different optimizations or bug fixes that could affect the outcome. By building the nodes ourselves, we eliminate any potential discrepancies between our environment and the environment where the issue was originally observed.

2. Prepare the Test Repository

Next up, we prepared the test repository. We cloned the v2-periphery-polkadot repository and checked out the bugfix/flaky branch:

git clone https://github.com/papermoonio/v2-periphery-polkadot
cd v2-periphery-polkadot
git checkout bugfix/flaky

This step is crucial because the bugfix/flaky branch contains the specific code that triggers the stack addressing underflow. Checking out the correct branch ensures that we're working with the exact codebase that exhibits the issue. This is essential for accurate reproduction and debugging. If we were to use a different branch, we might not be able to reproduce the issue, as the code might have been modified or fixed in the meantime. Therefore, it's vital to pay attention to the branch and commit history when trying to reproduce a bug.

3. Create the .env File

We created a .env file in the project root with the following content:

LOCAL_PRIV_KEY=0x5fb92d6e98884f76de468fa3f6278f8807c48bebc13595d45af5bdc4da702133
AH_PRIV_KEY=xxx

The .env file stores sensitive information, such as private keys, that are needed to run the tests. By storing these values in a separate file, we can avoid hardcoding them in our scripts, which is a security risk. The LOCAL_PRIV_KEY is used for local testing, while the AH_PRIV_KEY is likely used for testing in a different environment. It's important to protect these private keys and ensure that they are not exposed to unauthorized parties. Using a .env file is a common practice in software development for managing sensitive configuration data.

4. Run the Tests

Finally, we ran the tests using the following command:

pnpm install
USE_POLKAVM=true npx hardhat test ./test/ExampleComputeLiquidityValue.spec.ts --network local

Remember, we're using a local network due to the flaky behavior we mentioned earlier. The USE_POLKAVM=true flag is critical here, as it tells hardhat to use the Polkavm. This is how we trigger the PVM execution and, consequently, the stack addressing underflow. Without this flag, the tests might run without encountering the issue. The npx hardhat test command executes the specified test file (./test/ExampleComputeLiquidityValue.spec.ts) using the Hardhat testing framework. The --network local flag tells Hardhat to use the local network configuration, which we set up earlier. By running this command, we can reproduce the stack addressing underflow and observe the resulting error message.

The Error Output: Decoding the Message

Running the tests produced a clear error message, which is our key to understanding the problem. The full error log can be found here, but let's focus on the most relevant part:

Error: missing revert data (action="estimateGas", data=null, reason=null, transaction={ "data": "0x38ed17390000000000000000000000000000000000000000000000008ac7230489e80000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000a0000000000000000000000000f24ff3a9cf04c71dbc94d0b566f7a27b94566cacffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff00000000000000000000000000000000000000000000000000000000000000020000000000000000000000007b21801c4b7219bdeb3494ac98e948abbd25b2e9000000000000000000000000e07fd4cc631b88ad64d3782a7ecdc1d4c8382b70", "from": "0xf24FF3a9CF04c71Dbc94D0b566f7A27B94566cac", "to": "0x6CAa59f27B0b3b5Adc07a2b3EcB7142B3C74f424" }, invocation=null, revert=null, code=CALL_EXCEPTION, version=6.14.3)

The key part here is "missing revert data". This indicates that the transaction failed, but we didn't get a clear reason why. This is a common symptom of underlying issues like stack addressing underflows. The code=CALL_EXCEPTION further suggests that the contract execution encountered an exception during a call, which could be due to various reasons, including the stack underflow we're investigating. The `action=