What We Talk about When We Talk about a Decentralized GitHub

What We Talk about When We Talk about a Decentralized GitHub

Decentralizing a service is not only a technical issue but also about forming a new “social contract” among all the parties concerned. We are designing a decentralized Git hosting and collaboration system. This article explains why we need it and how to make it.

Decentralization opens up the data end of a traditional service provider such as SourceForge, Google Code, GitHub, or GitLab. The data end is a realm that all companies cling to (see the value of data) but should have been given back to the users. In a decentralized world, users do not stop using all traditional service providers but become independent from service providers. Users control their data — not just code or pull requests, but also all repository and user relationships. Users have the freedom to retain everything and seamlessly switch from one service provider to another. As a last resort, a user can be his or her own service provider. Companies compete in terms of service quality for access, use, and storage of user data. They serve users, rather than lock in users.

My other article “Light on the Dark Side of Network Effects” explains more about the underlying motivation.

System Architecture

The front end of a service system is amenable to decentralization. For example, MyEtherWallet, an open-source front-end product, can be put fully under your own control. Your wallet is yours. However, the back end has traditionally been enclosed and controlled by a service provider (their code may be open-source, but their data is not). The GDPR emerges to protect users’ data rights — it is a great legal movement, but not technically enough: You are given the option to export your data, but where do you go from there?

We propose to abstract a data end out of the traditional back end, which is managed by individual users through neutral, open-access blockchain technologies. An independent data end gives control back to the users, breaks the data barrier between companies so that they can compete more fairly and effectively, and enables more innovations based on open data (BTW, see our new open AI data license, as co-authored with Heather Meeker).

For a decentralized Git service, we structure the system as below.

Architecture of Decentralized GitHub

Let’s examine Gogs, an open-source Git service (similar to GitLab Community Edition but more lightweight), for example, to see how it fits into the decentralized ecosystem. A traditional service provider may deploy Gogs and store users’ data in a MySQL database managed by its administrator. Permissions of users to access and manipulate their data are granted by the Gogs server, instead of the other way around. If the Gogs server is shut down or does anything evil, users can hardly protect themselves because their rights can be revoked at any time (unless they file a lawsuit).

We decentralize Gogs and map out the relationships in the right way. Users should rely on a permanent trustworthy public facility, such as the Ethereum blockchain, to manage their data. The management rules are coded in a predefined smart contract, which executes unstoppably in a tamper- and censorship-proof manner. The smart contract is an open agreement among all parties. Users always have the absolute, cryptographically-guaranteed right to authorize any changes to the data they own or manage, according to the smart contract.

Companies still play a major role in offering most of the traditional front-end and back-end services. Particularly, most user data only have their hash values recorded in the blockchain since blockchain storage is highly expensive. The data is retrievable by hash values and physically stored as files in IPFS, backed up by the service providers or the users’ personal computers. The data files are stored in an open format using JSON, but the content can be end-to-end encrypted for a private project. At the front end, a service provider has to restore the data to MySQL for Gogs, which acts as a working cache for the data end. From the users’ perspective, Gogs operates almost the same way as it did before, yet when data changes occur or accumulate to a certain amount, the corresponding data owners or managers can check a human-readable summary and sign off the changes they agree to. If a service provider perishes or is guilty of misdeeds, users never lose control of their data. They just reject or correct any anomalies and switch to another service provider (possibly themselves).

Furthermore, the smart contract can be defined in a general enough form so that the users can choose between Gogs or GitLab or even GitHub to present and manage their data. The smart contract only models and controls the core ownership and authorization logic, while data files are convertible to different platforms.

Design Rationale

When we talk about design rationale, we are talking about trade-offs. If you did not find the description above to be concrete enough, we parse specific (technical) items in the smart contract below.

  • State sign-off vs. operation control. The traditional knowledge of databases or distributed systems usually leads people to think that, to be safe, all data operations should be guarded by a blockchain. BigchainDB follows that paradigm, but it is only suitable for system nodes. Applying it for human users is too much hassle. Just imagine if you had to take out your phone and click on the uPort app to approve a transaction every time you starred a project, left a comment, closed an issue… Also, check the average Ethereum transaction fees, and calculate your monthly bills. Essentially, it would be a waste of resources — there is no need to broadcast every such operation to all of the tens of thousands of nodes in the blockchain network. Instead, imagine that you are the boss of a company. You wouldn’t bother monitoring every employee’s everyday work. You’d just sign onto a big deal when it was accomplished. Right, be the boss of your data. Now and then, check on what changes are made and sign off the resulting state. As we only store a 32-byte hash value of the data state in the blockchain, doing so is very efficient.

Principle 1: The role of a smart contract is to let users authorize valid states of their data, rather than control every operation performed on the data.

  • Restriction vs. isolation. Users have the absolute power to control their data, but any absolute power is dangerous — some of the users may do evil. One example is to disorder the relations that Gogs restores to MySQL. A user may fake other entities’ IDs to gain “unjust enrichment”. To counteract such an attack, one approach is to map all objects onto the blockchain and enforce unique IDs in smart contracts. However, doing so conflicts with Principle 1, incurring the hassle and cost as discussed above. Besides, it would complicate the smart contract a lot. The more complex a smart contract is, the greater the chance that there will be a bug and that we will omit something that an adversary can compromise. In contrast, our approach is to extend the width of IDs in MySQL and append the owner’s unique Ethereum account address to each ID. As a result, a malicious user can fake IDs but can only affect his or her own data. That’s fair. Also, any patch in this approach applies to Gogs, instead of a deployed “immutable” smart contract.

Principle 2: A decentralized system should isolate multiple users in its working state, rather than complicate smart contracts by adding unnecessary restrictions on users.

  • Prevention vs. tolerance. Following up on the above point, isolation is only practical if the system can tolerate incorrect user data. This is not asking for the moon. Current systems already do so with external user inputs. Now, the system architecture and integrity assumptions change — the data end is out of the traditional back end and becomes an external component for the front end. An input check can be as simple as testing the validity of a name, but conducting a check is not as simple as it sounds if implemented on Ethereum (see this code). Also, it is questionable as to whether such ad-hoc policies should be hard-coded into a permanent smart contract. Our opinion is that the smart contract need not necessarily prevent users from doing all manner of stupid things and that it is Gogs that should skim those off. Again, we want a minimalist smart contract with minimal execution cost and zero vulnerability. Meanwhile, the smart contract must take on the responsibilities which individual service instances cannot, e.g., ensuring the global uniqueness of a user name.

Principle 3: A smart contract should only perform any check that individual service instances cannot perform.

  • Concurrency control vs. diff and merge. In “peacetime”, we suggest that teams of developers fix their default service providers, as centralized management is always more efficient (though you should not relinquish your rights). However, when you switch between service providers, inconsistencies may emerge, e.g., a few different pull requests showing up on two websites. Existing distributed systems already offer many solutions but they hardly apply cross-organizationally. In a decentralized world, the smart contract can act as the coordinator and help prevent multiple service instances from overwriting conflicting changes (in accordance with Principle 3). We require any hash value assignment to contain the previous hash value it sees, which the smart contract checks against the current record on the blockchain. If there is any mismatch, the transaction is rejected, and the user will have to pull the latest data files, then diff and merge.

Principle 4: A smart contract should verify any data state transition so that concurrent transactions do not silently overwrite conflicting states. Users are responsible for resolving any state merge conflict.

  • Being permanent vs. being forgotten. Those objectionable pictures will never, ever be erased from Ethereum. But what about a stupid comment you regretfully wrote on an issue? Fortunately, our architecture design supports a reasonable capability to forget data. Although the data hashes are stored on the blockchain, if all the original copies are extinct, no one will be able to retrieve the data. A service provider should typically promise to keep a number of history versions of your data, as Dropbox does. But, if you request that all versions be removed, the service should execute your request, especially if it complies with GDPR. Of course, it is possible someone else has already duplicated your file and stored it somewhere else, which is beyond your service provider’s responsibility, just like when someone may have cloned your open-source code before you decided to remove it from GitHub. We are working with Orrick to draft proper Terms of Use and Privacy Policy for service providers in the ecosystem and will “open source” them in this repo for free public use.

Next Steps

In addition to security concerns, another consideration is not to trigger a name competition, as in domain registration. User and organization names are unique and permanent assets on the blockchain. So, we plan the following actions to avoid irrelevant competition or speculation.

  1. We will reserve a number of established user or organization names, such as “ethereum”, “ipfs”, and “microsoft”. Once contacted by the real owners, we will transfer the ownership to them (for free, except for transaction fees).

  2. We respect the convention resulting from the wide use of GitHub. We will offer an opportunity for GitHub users to declare their names.