This site is in maintenance mode. Features may be unstable.
Warning! On-chain actions are not disabled.
Unbrick Nodle
Please check our post here describing the incident and its reasons. In this document we will be focused on a detailed description of the steps taken to resolve the issue.
TL;DR
The Nodle Parachain is bricked after its last upgrade failed. This proposal proposes to use the forceSetCode functionality on Polkadot to ensure the prompt restart of the parachain.
Solution
The solution presented below intends to maintain the existing Parachain state, and not require to revert previously approved transactions which would break finality guarantees.
It appears everything in the recent upgrade was sound with all state transitions being correct and intended. However, the migration code associated with the upgrade is taking too long and caused the proof of verification to grow beyond what the relay chain validators tolerates. Because the migration happens right when the runtime is upgraded, this causes the parachain to halt block production as the next block would always be rejected by the Polkadot validators. As such, the head of the parachain did not progress to any undesired state, and thus can be reused.
This means that it should be possible to unbrick the parachain by force setting the current code to a new runtime that is very similar to the stored runtime. As such, a new runtime has been prepared with the changes below:
- Identical spec version than initially deployed
- Keeps the uniques pallet at the same storage prefix
- Remove the migration
- Remove the new pallet associated to the migration
When this code is forced on the relay chain then collators can use the --wasm-runtime-overrides flag to force their nodes to use the appropriate wasm code that is recognized by the relay chain.
Testing and verification
Here are the steps we took to test this proposal:
- We used polkadot-launch and ran a local rococo relay chain with a few validators and a collator of our eden-dev parachain. All with fresh states starting from scratch. For the relay chain binary we used Polkadot-0.9.42 and for our parachain we used nodle-2.0.30 which is the previously published image for our nodes and contains the exact same runtime as it was on our chain right before the problematic upgrade.
- We made runtime wasm out of code that could easily induce the heavy migration step for a local test. We selected the integration of an infinite loop as the fastest way to reproduce the conditions of the issue.
- We upgraded the local collators runtime with the runtime wasm which had the intentional failure in its migration. We noticed the node indeed stopped producing blocks with the exact same message as we had seen on our collator “Discarding proposal …took too long“. This confirms correct reproduction of the current issue.
- We then used Sudo on the local rococo relay chain and called forceSetCurrentCode with the runtime wasm that we are going to use in our proposal. Let’s refer to that as the good runtime. We already have discussed the idea behind this code change in the Solution section above.
- We then stopped the collator and re-ran it using --wasm-runtime-overrides and setting it to the folder that contained the good runtime. The storage of the collator was maintained and no modification was made there. After the restart the collator started producing blocks and the relay chain backed and included them.
Proposal Preimage
Our proposal is paras -> forceSetCurrentCode(para: 2026, newCode: [our good runtime], leading to the preimage hash of “0xbdeb173184c3a932473b7921c0feb233900bbe9228c54dc346239a93b186e9cc” and preimage length of 1216833 as shown in the following screenshot:
Prevention Strategy
We intend to work with the Polkadot community to improve existing testing tools to detect similar issues to the one being fixed today. Namely, we would like to extend try-runtime to at least show if the PoV for a runtime upgrade is growing beyond the 5MB limit that is set for validators. Alternatively, this could also take the form of an independent test tool as well.
Regardless of what tool is developed, we will enforce the benchmarking of any migration code in our upgrade process.
Additionally, we would like to investigate whether some patches could be contributed to the Polkadot codebase to optimize for the liveness of parachains by falling back to the last known good runtime in case of upgrade or migration failures. Because this sounds like a much bigger effort given our team expertise, we will need further discussions on this topic.
Proposal Passed
3
of 3Summary
Voting Data
Approval%
Support%
Threshold0.00%
Threshold0.00%
Comments