One of the most potentially valuable language generation tasks that LLMs have shown strong abilities to perform, is that of code synthesis or program synthesis, the task of generating formal language that satisfies a set of constraints, particularly in the form of computer programs. Therefore, it is not surprising that multiple tools have appeared to perform that task, some of the most notable being GitHub Copilot, Ghostwriter, CodeWhisperer, Tabnine, and Duet AI. These tools promise to give a significant (often high double digits) overall performance improvement to software engineers.
While future models may be able to fully (or almost fully) automate the work currently performed by software engineers, current models are unable to do so, and thus find themselves in the same role as other tools like modern IDEs, time-traveling debuggers, and optimizing compilers. Like with all other previous tools software engineers will need to learn how to best use Generative AI coding tools in order to achieve the promised productivity gains, to do so they must first be aware of what the tools can and cannot do, i.e., their capabilities and limitations.
The most obvious measure of the above is the quality of the code generated by an LLM, which is usually evaluated by calculating the model’s pass@k (i.e., given k solutions pass@k is 1 when at least one solution passes all test cases, and is 0 otherwise). There are many benchmarks in use to evaluate pass@k (e.g., APPS, HumanEval, and MBPP), most of them are built from a variety of coding problems, each of which consists of the textual specification of the problem and a set of test cases that a correct solution to that problem must pass.
While these benchmarks are a good initial tool to evaluate the performance of an LLM in the task of code synthesis, the much greater variety of tasks that are unlike most of the problems presented in these datasets, and which are the bread and butter of what software engineers do most of the time, mean that they, ultimately, are a very coarse evaluation tool, particularly when it comes to understanding the limitations that these tools currently present.
Measuring Capabilities and Limitations
Since the ultimate purpose of any tool software engineers use is as an aid, measuring the overall productivity impact of these tools is paramount. What is the best way to measure the capabilities and limitations of tools that use generative models for code synthesis beyond pass@k? A holistic approach that looks at the impact on overall developer productivity looks like the best way of starting to properly measure these tools’ capabilities and limitations, for only after having such can we drill down and start identifying individual capabilities and limitations and start learning when and how we can best use these tools.
Impact on Overall Developer ProductivityA previous article describes some of what we have learned at Encora about the impact on the productivity of GitHub Copilot. There we talk about using the SPACE framework to measure the change in developer productivity the same way that A. Ziegler et al. did in their paper “Productivity Assessment of Neural Code Completion,” and how we confirmed that current code synthesis tools like GitHub Copilot provide an improvement of around +80% in overall perceived productivity. The team also made the following two main findings about the gains in overall productivity:
- We confirmed the finding of A. Ziegler et al. that the best metric to measure overall perceived productivity is the acceptance rate of shown suggestions. Furthermore, we confirmed that this rate drops quickly as more modifications are needed to be made to the suggestion for it to be accepted.
- Impact is uneven across developer experience levels, with less experienced developers seeing the biggest improvements. This does not only apply to the years of experience a developer has in total, but also to years of experience/familiarity with a particular stack.
Let’s break down each one of those and see what they tell us about what these tools can (and cannot) do, and how we can make the best use of them.
The Acceptance Rate is defined as the fraction of completions shown that are included in the source code. Two measures for acceptance rate are Click Through Rate (CTR) and Daily Completions accepted Per User (DCPU), DCPU is normalized by time spent coding each day. For reference A. Ziegler et al., saw an acceptance rate of 27% and a mean DCPU in excess of 31.
What drives the acceptance rate? A. Ziegler et al. found two main drivers, and we found a third one:
- The programming language being used.
- Circadian and weekly rhythms.
- Current repository size.
- Unsurprisingly languages that are better represented in the training dataset produce completions with a higher acceptance rate. Using these tools with languages that are not significantly represented in the training dataset will prove difficult, the option then becomes to fine-tune a model like StarCoder. The CodeCompose team saw the following lifts from fine-tuning to increase performance for Hack (Meta’s modified version of PHP) and Flow:
Table 1: Accuracy metrics across programming languages for the public model and the fine-tuned model from “CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring,” by V. Murali et al.As can be seen (most clearly with Hack) there is significant evidence that a model fine-tuned with internal data will outperform an off-the-shelf model that is trained only on external data, particularly for underrepresented languages.
A. Ziegler et al. also observed “strong regular patterns in overall acceptance rate” depending on the time of day and day of the week as can be seen in the following graph:
Figure 6: Average acceptance rate for hour-long time buckets during the week of "Productivity Assessment of Neural Code Completion" by A. Ziegler et al. Each point represents the average for such a bucket, whereas the shaded ribbon represents the min-max variation for single hours during the observed 4-week period.We clearly see that there are three statistically significantly distinct time regimes:
- The weekend (Saturdays and Sundays) with an average acceptance rate of 23.5%.
- Typical non-working hours during the week (i.e., evenings) with an average acceptance rate of 23%.
- Typical working hours during the with an average acceptance rate of 21.2%.
What drives the higher acceptance rate outside of standard working hours? A. Ziegler et al. considered both changes in the users’ behavior (e.g., users accept more suggestions because they are less focused), and changes in who is coding and what they are working on (e.g., personal projects as compared to company work).
To distinguish between the two, they “trained a model to predict a user’s acceptance rate for a particular time bucket from their usual contribution times (Table 4).” They found that the value of the time bucket didn’t matter, only whether it lay in the user’s usual time regime, and that users had lower acceptance rates for times outside their usual ones.
Finally, during our first project, we observed that the acceptance rate is affected by the size of the repository, though this might have been an artifact of it being a greenfield development with no previous code for the tool to leverage.
Developer Experience LevelsAt Encora we have observed that less experienced developers (be that in total years of experience or in total experience on the specific stack being used) see bigger overall productivity improvements. This is consistent with the effects on productivity seen by S. Noy and W. Zhang in their working paper “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,” where they observed that:
“Inequality between workers decreases, as ChatGPT compresses the productivity distribution by benefiting low-ability workers more. ChatGPT mostly substitutes for worker effort rather than complementing worker skills, and restructures tasks towards idea-generation and editing and away from rough-drafting.”
The last part is interesting because it means that experienced developers can apply their skills more readily when using technology that is new to them, but that the tools have been trained on. It also encourages more junior developers to think more holistically about what they are working on, encouraging the development of systems thinking skills and good design habits, but for this to happen there needs to be continuous guidance from more experienced developers. From the previous, we came to the following three conclusions:
- Junior developers paired up with more senior developers make better use of coding synthesis tools that junior developers do on their own, and not only is their current productivity increased, but their overall productivity improvements from learning are accelerated as well.
- More experienced developers will find that the tools are a great aid when learning (and using) new languages and frameworks, but they will not be as helpful with languages and frameworks they have already mastered.
- Systems thinking and system design skills become more important to overall productivity when using these tools since a bigger share of time is spent on tasks that rely on their use.
While S. Noy and W. Zhang explored the productivity impacts of ChatGPT on writing tasks, our experience of the effects of using LLM-powered tools matches theirs despite the difference in both tasks and tools being used.
As we have seen gaining the most out of the use of code synthesis tools like GitHub Copilot requires an understanding of their current capabilities and limitations. Current tools perform better on dynamically typed languages, popular languages, and bigger repositories. Current tools also provide a bigger boost to less experienced developers and make a pairing of junior engineers and more senior ones more effective in the long term. They also increase the importance of systems thinking and system design skills.
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.