CKBFS Protocol: A standard for Witnesses based content storage system

Short Abstract

CKBFS is a protocol designed to describe a witnesses based file storage system. With CKBFS, one can:

  • Store files on Nervos CKB Network
  • Publish large files that may exceed block size limitation(~500kb), across multiple blocks, while still keep the index and retrieve action simple and straight

It contains:

Backgrounds

Storing files on the blockchain can become troublesome and hard to manage. Solutions like Ordinals and DOBs on RGB++ use similar methods to inscribe images, text, and other content on BTC witnesses. While it works and becomes basement the current DOB ecosystem, it have some identical drawbacks:

  • No oracle guarantees
  • High costs
  • Even if you develop a native dapp on CKB, you still need to build various infrastructures across two chains
  • ā€¦

These led to the idea of designing this protocol.

Leveraging CKBā€™s powerful type system and programming model, this protocol was designed to be simple, flexible, usable, and reproducible.

Core Concepts

Core concepts of CKBFS is:

  • One single cell(called CKBFS cell) to index one file, even multi-part file that splited across blocks.
  • Permanent storage. Deletion is forbidden through contract, which means the file metadata will never lost. Which also means you will need to lock some CKB capacity FOREVER in order to create it.
  • Simple cell structure, encoded with molecule.
  • Built-in checksum provided, both on-chain and off-chain side.

The Protocol - or, the standard

Data Structure

CKBFS Cell

CKBFS Cell is a cell that stores:

A CKBFS cell in should looks like following:

Data:
  content_type: Bytes # String Bytes
  filename: Bytes # String Bytes
  index: Uint32Opt # referenced witnesses index.
  checksum: Uint32 # Adler32 checksum
  backlinks: Vec<BackLink>

Type:
  hash_type: "data1"
  code_hash: CKBFS_TYPE_DATA_HASH
  args: <TypeID, 32 bytes>,[<hasher_code_hash>, optional]
Lock:
  <user_defined>

The following rules should be met in a CKBFS cell:

  • Rule 1: data structure of a CKBFS cell is molecule encoded. See Molecule definitions below.
  • Rule 2: checksum MUST match with specified witnesses. Default checksum algorithm will be Alder32 if not specify hasher_code_hash in Type script args.
  • Rule 3: if backlinks(see definition below) of a CKBFS cell is not empty, it means file was stored across different transactions.
  • Rule 4: if hasher_code_hash is specified, then it will use hasher binary from CellDeps that matches code_hash, with same input parameter.
  • Rule 5: Once created, a CKBFS cell can only be updated/transfered, which means it can not be destroyed.

BackLink

BackLink stands for the prefix part of a living CKBFS cell. The strategy of CKBFS is similar to a linked list:

[backlink]ā†[backlink]ā†[ā€¦]ā†[CKBFS cell]

BackLink:
  tx_hash: Bytes,
  index: Uint32,
  checksum: Uint32,

Witnesses

File contents are stored in witnesses.

Witnesses:
  <"CKBFS"><0x00><CONTENT_BYTES>

The following rules should be met for witnesses used in CKBFS:

  • Rule 6: The first 5 bytes must be UTF8 coded string bytes of CKBFS, which should be: 0x434b424653
  • Rule 7: The 6th byte of witnesses MUST be the version of CKBFS protocol, which should be: 0x00.
  • Rule 8: File contents bytes are stored from 7th byte. Checksum hasher should also take bytes from [7ā€¦].

Operations

This section describes operations and restrictions in CKBFS implementation

Publish

Publish operation creates one or more new CKBFS cell.

// Publish
Witnesses:
  <...>
  <0x434b424653, 0x0, CKBFS_CONTENT_BYTES>
  <...>
Inputs:
  <...>
Outputs:
  <vec> CKBFS_CELL
    Data:
      content-type: string
      filename: string
      index: uint32
      checksum: uint32
      backlinks: empty_vec
    Type:
      code_hash: ckbfs type script
      args: 32 bytes type_id, (...)
  <...>

Publish operation must satisfy following rule:

  • Rule 9: in a publish operation, checksum MUST be equal with hash(Witnesses[index]).

Append

Append operation updates exist live CKBFS cell, validates the latest checksum.

// Append
Witnesses:
  <...>
  <CKBFS_CONTENT_BYTES>
  <...>
Inputs:
  <...>
  CKBFS_CELL
    Data:
      content-type: string
      filename: string
      index: uint32
      checksum: uint32
      backlinks: empty_vec
    Type:
      code_hash: ckbfs type script
      args: 32 bytes type_id, (...)
  <...>
Outputs:
  <...>
  CKBFS_CELL:
    Data:
      content-type: string
      filename: string
      index: uint32
      checksum: uint32 # updated checksum
      backlinks: vec<BackLink>
    Type:
      code_hash: ckbfs type script
      args: 32 bytes type_id, (...)
  • Rule 10: backlinks field of a CKBFS cell can only be appended. Once allocated, all records in the vector CAN NOT be modified.
  • Rule 11: new checksum of updated CKBFS cell should be equal to: hasher.recover_from(old_checksum).update(new_content_bytes)
  • Rule 12: content-type, filename, and Type args of a CKBFS cell CAN NOT be updated in ANY condition
  • Rule 13: in an append operation, Output CKBFS Cellā€™s index can not be null

Transfer

Transfer operation transfers ownership of a CKBFS cell, and ensure it did not lost tracking of backlinks.

// Transfer
Witnesses:
  <...>
Inputs:
  <...>
  CKBFS_CELL
    Data:
      content-type: string
      filename: string
      index: uint32
      checksum: uint32
      backlinks: empty_vec
    Type:
      code_hash: ckbfs type script
      args: 32 bytes type_id, (...)
    Lock:
      <USER_DEFINED>
  <...>
Outputs:
  <...>
  CKBFS_CELL:
    Data:
      content-type: string
      filename: string
      index: null
      checksum: uint32 # updated checksum
      backlinks: vec<BackLink>
    Type:
      code_hash: ckbfs type script
      args: 32 bytes type_id, (...)
    Lock:
      <USER_DEFINED>
  • Rule 14: in a transfer operation, Output CKBFS Cellā€™s index MUST be null
  • Rule 15: if Input CKBFS Cellā€™s backlinks is empty, then outputā€™s backlink should be append following Rule 10. Otherwise, the backlinks should not be updated
  • Rule 16: in a transfer operation, checksum CAN NOT be updated

Other Notes

Molecule Definitions:

Hereā€™s molecule definitions of CKBFS data structures

vector Bytes <byte>;
option BytesOpt (Bytes);
option Uint32Opt (Uint32);

table BackLink {
  tx_hash: Bytes,
  index: Uint32,
  checksum: Uint32,
}

vector BackLinks <BackLink>;

table CKBFSData {
  index: Uint32Opt,
  checksum: Uint32,
  content_type: Bytes,
  filename: Bytes,
  backlinks: BackLinks,
}

Checksum Validator Procedure:

Bellow is pseudocodes shows how one can validates the checksum:

function validate_checksum(witness, expected_checksum, backlinks);
var
  hasher: thasher;
  computed_checksum: uint32;
  content_bytes: bytes;
  last_backlink: backlink;
begin
  // If backlinks is not empty, recover hasher state from the last backlink's checksum
  if length(backlinks) > 0 then
  begin
    last_backlink := backlinks[length(backlinks) - 1];
    hasher.recover(last_backlink.checksum);
  end;

  // Extract the content bytes from the witness starting from the 7th byte
  content_bytes := copy(witness, 7, length(witness) - 6);
  
  // Update the hasher with the content bytes
  hasher.update(content_bytes);

  // Finalize and compute the checksum
  computed_checksum := hasher.finalize;
  
  // Compare the computed checksum with the expected checksum
  if computed_checksum = expected_checksum then
    validate_checksum := true
  else
    validate_checksum := false;
end;

Advanced Usage - Branch Forking File Appendix

Assuming that we have created a CKBFS Cell:

CKBFS_CELL:
  Data:
    content-type: string
    filename: string
    index: 0x0
    checksum: 0xFE02EA11
    backlinks: [BACKLINK_1, BACKLINK_2, ...]
  Type:
    code_hash: CKBFS_CODE_HASH
    args: TYPE_ID_A
  Lock:
    <USER_DEFINED>

It is able to creating a forking of this CKBFS by a special publish, similar to append but put the referenced CKBFS Cell in CellDeps:

CellDeps:
  <...>
  CKBFS_CELL:
	  Data:
	    content-type: string
	    filename: string
	    index: 0x0
	    checksum: 0xFE02EA11
	    backlinks: [BACKLINK_1, BACKLINK_2, ...]
	  Type:
	    code_hash: CKBFS_CODE_HASH
	    args: TYPE_ID_A
	  Lock:
	    <USER_DEFINED>
	<...>
Witnesses:
  <...>
  <CKBFS_CONTENT_BYTES>
  <...>
Inputs:
  <...>
Outputs:
  <...>
  CKBFS_CELL
    Data:
      content-type: string
      filename: string
      index: uint32
      checksum: UPDATED_CHECKSUM
      backlinks: [BACKLINK_1, BACKLINK_2, ...]
    Type:
      code_hash: ckbfs type script
      args: TYPE_ID_B
  <...>

And we are able to create a variant versions from a same reference data, allowing us to achieve something like git branching, header-shared data, etc.

9 Likes

Here are example transactions that use deployed testnet contract. The final whole content is:

HELLO CKBFS
HELLO CKB
By Code Monad, 2024
2 Likes

Hey @codemonad, congratulations on your hard work, finally a design for an IPFS on Nervos L1!! :partying_face:

I was wondering, could you detail a bit more which data is immutable and which data is mutable in this design?

2 Likes

While itā€™s all described in ā€œRulesā€, i can give you a simple summary about the mutability if you donā€™t understand.

The structure
Hereā€™s the data structure of a CKBFS Cell:

Data:
  content_type: Bytes # String Bytes
  filename: Bytes # String Bytes
  index: Uint32Opt # referenced witnesses index.
  checksum: Uint32 # Adler32 checksum
  backlinks: Vec<BackLink>

and BackLinkā€™s Structure:

BackLink:
  tx_hash: Bytes,
  index: Uint32,
  checksum: Uint32,

What is immutable

  • CBFSā€™s content_type
  • CKBFSā€™s filename
  • all allocated Backlinks. Which means all fields in Backlink can not be modified if it was already appended to the CKBFS Cellā€™s data.

What is mutable

  • CKBFSā€™s index. this should be referenced to the witness index everytime you append(write) a new content part to the file
  • CKBFSā€™s checksum this should be updated everytime the content body changes.
  • CKBFSā€™s backlinks can be appended(push new items), but can not delete and modify previous items.
1 Like

Sorry, still not seeing the full picture :thinking:

Iā€™ll propose a simple ideas (that you likely already considered) and you tell me why you choose the proposed one. @xxuejie did the same with the iCKB design, so itā€™s a fair game.

Diffs

Reading once again your proposal, I noticed that the the underlying reason why you choose the linked list (instead of other more commonly used structures in FS like trees of inodes) is because you can recover the previous checksum and use it to checksum the full file. This is smart because:

Now we have a different issue, diffs.

Letā€™s assume we have a JSON split across 100 txs and I need to update the one in the first tx, then in your design I have to re-deploy all 100 txs, in correct? Could you detail more the Branch Forking File section?

Maybe it would be easier to support only immutable data, but having a CKBFS that fully support diffs would prime it for broader adoption.

Relaxing the assumption on the checksum of the full file and so the necessity of using a linked list, we could switch back to the usual trees of inodes seen in filesystem and Merkle trees.

Sure, we may lose the ability to reference a file with a checksum (as the same file may have many different checksum, depending on the underlying structure), but is this really relevant?

We can just anchor all references to the root inode OutPoint.

Pretty sure there are better and more formalized designs based on an immutable FS and Merkle trees, this is just to give an idea.

So, why did the presented design uses a Linked List instead of an inode Tree?

Homogeneity

The proposed design seems homogeneous, maybe too homogeneous.

Data:
  content_type: Bytes # String Bytes
  filename: Bytes # String Bytes
  index: Uint32Opt # referenced witnesses index.
  checksum: Uint32 # Adler32 checksum
  backlinks: Vec<BackLink>

Why does every element has to contain content_type and filename?

Every element in the list except the root does not seem to need this particular information.

Why putting the full backlinks: Vec<BackLink>?

Strictly speaking, only the last one is used in the checksum verification, the rest is only used by the off-chain system to get all the relevant information faster. (An inode design on the other side would make full use of this info tho)

Generally, why putting all this information in cell data?

Once we put the checksum as data and the cheksum is checked, the safety of information in witness is aussured, why not moving all this data to the witness?

Why not employing at least two types of cells?

One type could store the files metadata and link all the data, the other one would be used to store the actual data.

Why every lock needs to be <USER_DEFINED>?

Every element in the list (except possibly the root) does not seem to need this particular information as the inner cells are immutable, correct?

Why not directly indicating an always-failure lock?

For now back to working on iCKB, Iā€™ll be waiting for your replies. In the meantime, I wish you a nice day :hugs:

Love & Peace, Phroi

1 Like

given the diversity of regulation in the world, we should assume that some nodes will prune some data that is committed to the chain.

Is there anywhere consideration for this should be made in CKBFS?

1 Like

I think you may misunderstood something about this design.
Back to the core concepts, thereā€™re a few points i do want to mention about:

  1. only use one single cell to index one file:
    CKB has a block size limitation, so the point we upload our file into multiple part is not just we want multipart, it is because we do not want to limit to just ~500Kb.

  2. simple cell structure:
    Just like how i designed the Spore protocol, introducing extra complexity will always be treat with ultra high caution.
    The goal of CKBFS is NOT a reimplementation of somehow existed filesystems, and is NOT a port of unix-style fs structure. It is for file STORAGE, just as this thread title says. And do what feats the CKB model itself.

Simplicity is important to me, and also when distributing generic protocols or proposals. Take in one thing is simple, but we need to consider whether if it is necessary, does it provide anything unique and important.

also in the design, i have written something like:<vec>CKBFS Cell, and it does not mean we need splitted cells in order to publish one file. it just means you can publish multiple CKBFS Cell at once, with no issue, and they will not influence each other.

And here i will add responses to your questions, maybe not order by order of your original post, but i hope it can give you and other readers a more clear image of this design.

So, why did the presented design uses a Linked List instead of an inode Tree?

The first important thing is to keep the complexity in control.
As i mentioned previously, this is not a port of unix-style filesystem, but i do want to point out that the layout, or the design philosophy is similar(tho not same) to what you would usually see on a realworldā€™s inode. Their core abstract are similar.

Why every lock needs to be <USER_DEFINED>?

As you can check in many CKBā€™s official RFC, and also the Spore Protocolā€™s RFC, we can see it anywhere. <USER_DEFINED> means we do not care what lock it uses. it can be a simple secp lock, or a omnilock, or anyonecanpay, or a custom implemented lock script. thats is why we have two scripts. by composing different lock scripts with type scripts, we can implement many different features and reusing generic parts we want.

For example, i can put an ā€œanyone can payā€ lock to this ckbfs cell, and it just simply becomes a ā€œco-creationā€ work, just like in the realworld you can share a notebook to others to continue your story on it. remember this is just a example i provided for you to understand why we can, and should allow users to use different locks. putting ā€œalways failā€ on it will means: this fill can not be modified once published, which may also be a way to use it if you want a certain feature like this.

Why does every element has to contain content_type and filename?

Just ensure we did not lost how the design is, one cell is just for one file. we are not composing multiple cells. content_type and filename are hints for ā€œusersā€ that may referencing and get the file. for example dapps, indexers.

Why putting the full backlinks: Vec<BackLink>?

If we do not want to lost tracking for the full contents, i think this is necessary. checksum is just a verification ā€“ you sure will want this if you are retrieving files from remote.

Generally, why putting all this information in cell data?

The core design may follow this: the actual data goes to witnesses, metadata goes to cell data. since we want the cell to be the key we reference it. a living cell means a existence of a file. and also prevents people falling into unexpected data correction hell. And also, this will ensure that people can not do scamming stuffs, since type script can help us verify it everytime we operates on chain.


Last part might be a bit longer and verbose, so i put a split line before this.

Could you detail more the Branch Forking File section?

For example, i have a text file:

HELLO CKBFS
HELLO CKB

which the adler-32 checksum of this would be: 1014105452
And i published it as a CKBFS Cell#1:

Data:
  content_type: "plain/text"
  filename: "hello_ckbfs.txt"
  index: 0 # just assume we put it at Witnesses[0] while publishing
  checksum: 1014105452
  backlinks: []

Now i want to publish another file, that put content in tail of CKBFS Cell#1, so that the content will be:

HELLO CKBFS
HELLO CKB
HELLO BTC

and i will get a different CKBFS Cell#2, the publish operation tx will looks like this:

CellDeps:
  <...>
  CKBFS_CELL #1:
	  Data:
	    content-type: "plain/text"
	    filename: "hello_ckbfs.txt"
	    index: 0x0
	    checksum: 1014105452
	    backlinks: []
	  Type:
	    code_hash: CKBFS_CODE_HASH
	    args: TYPE_ID_A
	  Lock:
	    <USER_DEFINED>
	<...>
Witnesses:
  <CKBFS, 0x00, "HELLO BTC" >
  <...>
Inputs:
  <...>
Outputs:
  <...>
  CKBFS_CELL
    Data:
      content-type: "plain/text"
      filename: "hello_ckbfs_btc.txt"
      index: 0x0
      checksum: 2136344547
      backlinks: 
        BackLink: 
            tx_hash:  TX_HASH_1 #`CKBFS#1`'s Outpoint.tx_hash
            index: 0x0 # keep it as same as #1
    Type:
      code_hash: CKBFS_CODE_HASH
      args: TYPE_ID_B
  <...>

Notice that after this, we will get TWO different files(CKBFS Cell), one is CKBFS Cell#1, and CKBFS Cell#2, they be belongs to different locks(different user), and they can do updates(appends) to their cells without any confliction.
Like Alice is gonna update #1 into:

HELLO CKBFS
HELLO CKB
HELLO RGB++

And Bob can still update #2 into:

HELLO CKBFS
HELLO CKB
HELLO BTC
HELLO DOB

Here we got a ā€œbranching forkingā€ feature, just like how we use git. Remember we are using Outpoint.tx.hash as the actual value, if it does not meet, will cause a failure. just use load_input_out_point(index, Source::CellDep).tx.hash() as you can see in the implementation

since we can not control how node runners treat with historical datas, especially only through contract scripts, i think it is hard to say ā€œwhat we can do with this protocolā€

just like how we encourage miners, we can do something like how PT(Private Torrent Trackers) works

Usually BT protocols will lost their ability to download/seeking files if no one is seeding(uploading) anymore. PT have many interesting rules to keep torrents alive, for example they will give rewards to people who provides upload workloads, and downloads will cause peopleā€™s share ratio downgrades so that their download speed will be limited, etc.

Just like when we launched CKB Node Probe, we want more people start running full nodes, to make the network stronger. it do needs to consider how people would like to store historical committed datas.

4 Likes

I donā€™t think cells canā€™t not be destroyed is a good idea. this will let ckb unavailable in a long future.

it looks like a small amount of CKB though, the data is needed to make sense of the data in witness from what i gather

3 Likes