#Can we teach AI how to code? Welcome to IBM’s Project CodeNet

Table of Contents

“#Can we teach AI how to code? Welcome to IBM’s Project CodeNet”

IBM’s AI research division has released a 14-million-sample dataset to develop machine learning models that can help in programming tasks. Called Project CodeNet, the dataset takes its name after ImageNet, the famous repository of labeled photos that triggered a revolution in computer vision and deep learning.

While there’s a scant chance that machine learning models built on the CodeNet dataset will make human programmers redundant, there’s reason to be hopeful that they will make developers more productive.

Automating programming with deep learning

In the early 2010s, impressive advances in machine learning triggered excitement (and fear) about artificial intelligence soon automating many tasks, including programming. But AI’s penetration in software development has been extremely limited.

Human programmers discover new problems and explore different solutions using a plethora of conscious and subconscious thinking mechanisms. In contrast, most machine learning algorithms require well-defined problems and a lot of annotated data to develop models that can solve the same problems.

There have been many efforts to create datasets and benchmarks to develop and evaluate “AI for code” systems. But given the creative and open nature of software development, it’s very hard to create the perfect dataset for programming.

The CodeNet dataset

With Project CodeNet, the researchers at IBM have tried to create a multi-purpose dataset that can be used to train machine learning models for various tasks. CodeNet’s creators describe it as a “very large scale, diverse, and high-quality dataset to accelerate the algorithmic advances in AI for Code.”