IBM Releases CodeNet Dataset For AI Coding

IBM has launched Project CodeNet, a dataset geared toward educating AI to translate code from one programming language to a different. The dataset consists of 14 million code samples, made up of round 500 million traces of code in 55 programming languages, starting from C++, Java, Python, and Go to Cobol, Pascal, and Fortran.

IBM Research says Project CodeNet can be utilized to coach machine studying fashions to translate code. The code samples have been taken from entries to open programming competitions, and IBM says that over 90 % of the code samples include an outline of what the code does, together with a concise downside assertion, specification of the enter format, and the output format.


The builders say that for over half of the coding issues they’ve additionally obtained pattern enter and output from the issue description, which they are saying is essential to figuring out equivalence of two code samples in several languages, and which may drive reinforcement studying strategies for code translation. The samples additionally embody data such because the code dimension, reminiscence footprint, CPU run time, and standing, which signifies acceptance or error sorts.

The IBM staff estimates that automated rule-based techniques might be profitable in translating someplace between 50 to 60 % of a program into one other programming language, leaving the rest to be translated manually, involving advanced guidelines.

The hope is that Project CodeNet will have the ability to “drive algorithmic innovation” to extract the extra advanced code utilizing sequence-to-sequence fashions, in the same option to how language translators for human languages now use. The purpose is to make a extra vital dent in machine understanding of code versus machine processing of code.

The challenge contains instruments to transform code samples right into a illustration that may be consumed by AI algorithms, together with a tokenizer that generates stream of tokens, a parser that generates a Simplified Parse Tree (SPT) for every acknowledged program, and a code evaluation device that creates management and information circulate graphs.  Project CodeNet is offered on GitHub.


More Information

Project CodeNet On GitHub

Related Articles

IBM’s Elyra AI Toolkit

IBM Debater Argues Like A Human – But How?

New MIT–IBM Watson AI Lab

A New Impetus For IBM Watson



To be told about new articles on I Programmer, join our weekly newsletter, subscribe to the RSS feed and comply with us on Twitter, Facebook or Linkedin.






or e-mail your remark to:


Please enter your comment!
Please enter your name here