Open data infrastructure for AI – who is developing it and why

May 4, 2023

An independent consortium of American organizations is currently working on establishing a network of open data repositories. These repositories will serve as a foundation for training machine learning models across various industries, including but not limited to medicine and climate research. Although this project is in its initial stages, it’s worth discussing its necessity.

Data inconsistency

Language models have been a topic of widespread discussion in the media lately, with even Elon Musk initially calling for a temporary halt to neural network development before changing his mind and founding a company to train ML models. These models, such as ChatGPT, are already being utilized in business intelligence systems to assist company leaders in making strategic decisions.

However, there is a potential issue with training large language models using internet-sourced data – that data may contain inaccuracies or bugs, particularly when it comes to program code. For instance, ChatGPT developers had to disable their bot in late March when it began to expose other users’ request histories due to an error in an open-source component.

Experts contend that training language models using such data may result in their generating suboptimal programs. Moreover, as one ETH Zurich professor notes, the responses of language models can be influenced by malicious injections into training samples. In theory, creating curated datasets for training AI systems could be a solution to this problem. To avoid plagiarism, I have rewritten the original text using my own words while retaining the main points made in the original text.

The path to open data

NASA, along with other American organizations such as the National Science Foundation and National Institutes of Health, is funding a project to establish the Open Knowledge Network (OKN). The OKN will be an open collection of repositories containing data and related knowledge graphs. It will essentially function as a cloud-based infrastructure for developing machine learning models across industries such as healthcare, law enforcement, space, and natural sciences.

Currently, organizations are seeking contractors for each of the project’s three development stages. The initial stage will focus on forming knowledge graphs to tackle profile problems. While the second stage will involve developing and deploying an infrastructure for exchanging data. Finally, the third stage will center on creating training materials and tools for interacting with the OKN. To avoid plagiarism, I have rephrased the original text using my own words while retaining the key points expressed in the original text.

Private initiatives

Projects related to user identification are being developed alongside initiatives aimed at forming intelligent technologies and their integration into the internet space. With the increasing difficulty of distinguishing between human-generated and machine-generated content. Enthusiasts propose protocols for determining the “humanity” of participants in network communications.

One such proposed protocol is PeerID, created by a Hacker News user. Personal identification involves a physical meeting of two participants who place a special p2p signature in a distributed ledger. Without exchanging passports or other data. A service called “oracle” verifies the data in the registry and calculates an individual level of trust for each user. Which is affected by the number of completed “physical” verifications. The oracle then generates a zero-knowledge proof that the client application can use as an identifier.

At present, the project is only a concept, and its future direction is unclear. However, it is likely that new mechanisms will emerge to help people distinguish themselves from machines and bots.

Conclusion

In conclusion, the development of open data repositories and knowledge graphs is an important step toward training more accurate and reliable machine learning models. However, the quality of the data used to train these models remains a concern. As the internet can contain errors or maliciously injected samples that can influence the responses of language models. The solution to this problem may lie in the development of curated data sets for AI training.

Furthermore, with the increasing presence of machine-generated content on the internet. Protocols for determining the “humanity” of participants in network communications are being proposed. PeerID is one such protocol that involves personal identification through physical meetings and a distributed ledger. Which is verified by an oracle to calculate an individual level of trust. Although the project is still in its early stages and its future is uncertain. It is likely that more mechanisms will be developed to distinguish between humans and machines in online interactions.