RFC(ckb): Better Support to Query Transaction Status

doitian · August 4, 2021, 1:08pm

Motivation

RPC get_transaction only looks for transactions in the chain and the transaction pool. It returns null for transactions which have been accepted into the pool and later removed. It’s possible to subscribe to the rejected_transaction events, but it requires a long TCP connection.

For most scenarios, it is too expensive to maintain a long connection to subscribe to the events. The alternative solution is polling the transaction via get_transaction. For transactions submitted successfully by send_transaction, if get_transaction returns null, the transaction is removed. If the returned status is committed, the transaction has been successfully committed into the chain. However, these solution has some issues:

There is a bug (ckb#2907) in ckb-rs. Calling get_transaction immediately after a success send_transaction may return null. It may take a while to become pending.
There is no way to distinguish whether the transaction provided is unknown or recently removed.
The RPC get_transaction returns the whole transaction each time, which is an unnecessary performance overhead for the polling scenario.

This RFC proposes a solution to address these issues.

Specification

Find the root cause of ckb#2907 and fix it.
RPC get_transaction no longer returns null, instead, the response field transaction may be null. The field tx_status.status adds two new statuses: rejected and unknown. A new field tx_status.reason tells why the transaction is rejected.
RPC get_transaction will accept a new optional request parameter to tell it not to return the transaction field.

RPC `get_transaction`

RPC get_transaction no longer returns null, instead the response field transaction may be null.

Field tx_status.status adds two statuses rejected and unknown.

rejected: The transaction has been recently removed from the pool. Due to storage limitations, the node can only hold the most recently removed transactions.
unknown: The node has not seen the transaction, or it should be rejected but was cleared due to storage limitations.

When tx_status.status is rejected or unknown, the return field transaction must be null.

If tx_status.status is rejected, tx_status.reason tells why it is rejected. This is a new field of type string. New reasons may be added in future releases.

Also get_transaction adds a request parameter verbosity of type Uint32, which defaults to 2.

When verbosity is 0 (deprecated): this is reserved for compatibility, and will be removed in the following release. It return null as the RPC response when the status is rejected or unknown, mimicking the original behaviors.
When verbosity is 1: The RPC does not return the transaction content and the field transaction must be null.
When verbosity is 2: if tx_status.status is pending, proposed, or committed, the RPC returns the transaction content as field transaction, otherwise the field is null.

To support the rejected state, the node must store the recently removed transactions. These include transactions submitted via RPC, received via P2P networks, and from the reverted blocks due to the chain reorganization.

The default configuration only allows up to 10,000,000 transactions in the last 7 days. The configuration can be adjusted according to the node storage size. Since only transaction hash is stored, where each hash occupies 32 bytes, thus 10,000,000 entries will take about 300M of disk storage space without counting the additional overhead.

[tx_pool]
...
keep_rejected_tx_hashes_days = 7
keep_rejected_tx_hashes_count = 10_000_000

Scenarios

First of all, it is strongly recommended that the application keeps a copy of the transaction locally until the transaction is confirmed in the chain, or the application decided to discard it. Do not rely on the ckb node to persistent pending transactions. The current ckb implementation does not restore the transaction pool after reboot. The pull request ckb#2656 adds such a feature, but the node may still lose pending transactions, such as disk failure, or failing to save the dump because of sudden power loss.

Single Node

For applications that use only one node to send transactions and check their status, it’s recommended to poll get_transaction(verbosity = 1) after a successful send_transaction.

If it returns unknown or rejected, the transaction is considered as rejected. Because after the success send_tranasction, the node must have seen and accepted the transaction before. Of course, there are situations where a node loses the transaction after a restart. In this case the application can resend the local saved transaction via send_transaction. If send_transaction rejects the transaction, the application acts according to the error message. If send_transaction succeeds, the node must have lost the transaction and the application can continue to poll the transaction status.
If the transaction is confirmed as rejected, the application can either drop it or reassemble it with new cells depending on the use case.
If the transaction is confirmed as committed, it is better to wait for enough confirmations by new mined blocks.
If the transaction is not confirmed as rejected or committed after a long time, the application should trigger the broadcast by recall send_tranaction, or just simply recall send_tranction regularly during status polling.

Multiple Nodes

Some applications use several CKB nodes behind a load balance for scalability and availability. It is recommended to use the Master-Standby load balance strategy, where the client always connects to the current master. When the master fails, a node is prompted to the new master from standby nodes. The Sticky Sessions Management also can help. Such load balance associates each client to its own master like, and uses other nodes as the fallback. A popular association method is hashing the client IP into a number and choosing the master using the number.

Simple Round Robin strategy, or in the case of a node failure in Master-Standby, there will be a delay that the transaction appears in get_transaction after a success send_transaction. The application must take this delay into account.

The appendix provides some suggestions to make transaction synchronization between multiple nodes more timely and reliable.

Workaround

For a CKB node which does not implement this RFC, here is a workaround.

The application uses get_tranasaction to poll transaction status after a successful send_transaction. If the RPC returns null within 20s after the send_transaction, the application can treat it as pending. If the RPC returns null after 20s, the application can consider it as rejected and try to recall send_transaction.

Related Work

The CKB Transactions Management Guideline provides some advice on how to manage pending transactions.
Currently the Block Explorer does not mark transactions as rejected when the dep cells have been used by other transactions.

Related Feedback

The following dialogues have been edited to avoid privacy leaks. There are two roles, A: and B:, where A is the person who submits the feedback and B is the technical support.

2021-07-23

A: When sending a transaction, there are two kinds of failures. The first one is due to cell preemption, the transaction is rejected; the other one is sending a transaction without error, but the transaction is not available in the explorer and the node has no related information as well. What is the general cause of this?

2021-07-29

A: The transaction has been stuck for half an hour, node version 0.42

B: The first two cell deps have been spent.

A: Those two cells will be updated continually. After such a long time, they must have been spent, so this is not the reason why the transaction has been stuck. In fact, if the cell had been spent when the transaction was first sent to the node, it would have reported an error. But it did not report an error and the transaction is still pending, so we can infer that it was not spent at that time.

A: This is the first problem, another problem is that if the time is too long, after one of the dep cell has been spent, calling the node rpc interface returns nothing. So we hope that the node can not return something, such as rejected, to help the subsequent error logic processing.

Unresolved Questions

Because of P2P and PoW, transaction status transition could be very complex even after implementing this RFC. For example, the transaction still can be unknown or rejected suddenly, and later become pending again. Because the transaction is removed from the pool first, and later it is relayed back from other peers.

There’s no RPC to tell whether a transaction has been broadcast to the P2P network. It is a useful hint to determine whether to rebroadcast the transaction.

Appendix

Multi-Node Transaction Synchronization Suggestions

Whitelist

The nodes should add each other to the whitelist via the option [network].whitelist_peers in the configuration file ckb.toml.

The node connects to the whitelist nodes first and retries connecting after disconnections. The configuration option is an array, which each item looks like:

"/ip4/10.0.0.1/tcp/8115/p2p/QmWxucJPjKpfZuG7kTzYQLzRfv1h8nyMjnLBFxHDWFENjA"

The part after ip4/ is the IP of the node. If the nodes are in the same LAN, using the intranet IP takes the advantage of the intranet bandwidth. The number 8115 is the p2p network listening port, which is configured via [network].listen_addresses in the same configuration file.

The last part after p2p/ is the peer-id of the node. The following command prints the node peer-id.

ckb peer-id from-secret --secret-path data/network/secret_key

Note that the ID changes after deleting the secret key file, which requires updating whitelist_peers to use the new peer ids.

Take an example of three nodes with IP 10.0.0.1, 10.0.0.2, and 10.0.0.3. After the three nodes have been initialized and have run ckb run at least once, use the command above to get their peer ids. Assume that the result is:

10.0.0.1: QmWxucJPjKpfZuG7kTzYQLzRfv1h8nyMjnLBFxHDWFENjA
10.0.0.2: QmTPYTsio5MGQkPTdVwYgM5xKcKGftx9qoBALhJi7oUKNt
10.0.0.3: QmQ7k9RYAgvWt5mWvbGG85SiXf23hjGSVjmtnsHMqzs7Hx

The node 10.0.0.2 should add the other two nodes into the whitelist like below.

whitelist_peers = [
  "/ip4/10.0.0.1/tcp/8115/p2p/QmWxucJPjKpfZuG7kTzYQLzRfv1h8nyMjnLBFxHDWFENjA",
  "/ip4/10.0.0.3/tcp/8115/p2p/QmQ7k9RYAgvWt5mWvbGG85SiXf23hjGSVjmtnsHMqzs7Hx"
]

Configure the other two nodes accordingly.

Transaction Multicast

The most straightforward way is to send the transaction to multiple nodes at the same time by calling their RPC method send_transaction.

Some load balance supports sending the matched requests to all backend nodes and returning the response from the fastest node. The application also can implement the send_transaction gateway to send the transactions to all the nodes.

Another option is to deploy a transaction forwarder on each node. The transaction forwarder listens for new_transaction events via the subscription RPC. When a new transaction is received, it forwards the transaction to other nodes via their RPC.

doitian · August 17, 2021, 8:33am

Updated:

New RPC get_pool_entry will not be added in this RFC.
A new field tx_status.reason is added.