Custom Models for GitHub Copilot
The custom models option for GitHub Copilot is now available in public beta, allowing developers to fine-tune Copilot to understand better and align with each organization’s unique coding practices.
This new capability improves the relevancy and accuracy of code suggestions in projects, so we’ve compiled the most important points to remember and the keys to getting the most out of it now.
What are Custom Models?
Custom models are LLMs refined using an organization’s code bases. By training the model on proprietary libraries, specialized languages, and internal coding patterns, Copilot will be able to provide code suggestions that are more context-sensitive and tailored to the needs of each case.
You can now create a custom model using your own GitHub repositories, and you can also enable the collection of code snippets and telemetry from other developers’ Copilot prompts and responses, to further fine-tune the model.
This aligns Copilot’s suggestions with each developer’s coding practices, helping to make them more relevant and accurate. This translates into less time spent on code review, debugging, and manual code tuning, and therefore higher productivity and better code quality.
When to use custom models?
As in all other processes, there is a tool for every moment. In this case, you should consider using customized models in the following scenarios:
- Improve library and API usage: A model can prioritize custom libraries and APIs for its suggestions, making it easier to follow internal standards.
- Improve support for specialized languages: fine-tuning helps Copilot better understand less common or proprietary languages, reducing friction and improving productivity.
- Adapt to evolving code bases: By periodically training your code base, you can also ensure that Copilot stays up-to-date with the latest coding patterns so that it continues to provide relevant and accurate suggestions.
Create a Custom Model in GitHub Copilot
As it is still in its beta version, only an organization within a company can create a customized model.
Once assigned as the owner of the organization, you can choose which repositories will be used to train the model. The model can be trained on one, several, or all of the organization’s repositories, and is trained on the content of the default branches of the selected repositories.
The custom model will be used to generate code completion suggestions on all file types, regardless of whether that file type was used for training. And you can also choose whether telemetry data should be used.
Once started, the creation of a custom model will take several hours to complete. When the process is complete, you will be notified by email, and if it fails, Copilot will continue to use the current model to generate code completion suggestions.
When the model has been successfully created, all managed users in the enterprise who have access to Copilot Enterprise in the organization where it has been deployed will begin to see code completion hints generated by the custom model.
To test the effectiveness of the model, it is advisable to evaluate the usage and satisfaction levels of GitHub Copilot code completion suggestions before and after model deployment. This can be done by using REST APIs or by surveying developers on their perception and satisfaction with the model suggestions.
Implementing the custom model
Here are the steps to follow to set up a personalized LLM:
- In the upper right corner of GitHub, select “Your organizations” and then “Settings”.
- In the left sidebar, click on “Copilot” and then “Custom Model”.
- On the “Custom Models” page, click “Train a new custom model” and then “Select repositories” and choose from all or selected repositories.
- If you choose selected repositories, select the ones you want to use for training and then click on “Apply”.
- Optionally, if you prefer to train your model only with code written in certain programming languages, go to “Specify languages” and type the name of a language you want to include. Select the one you want from the list displayed and repeat the process for each language you want to include.
- Click on “Create new custom model”.
*Extra: To improve the performance of the model, select the check box labeled “Include prompt and hint data”. This will allow Copilot to collect data from the user-submitted prompts and code completion hints that were generated. Once sufficient data has been collected, Copilot will use it as part of the model training process, allowing it to produce a more efficient model.
Aspects to be considered
You will be able to check the progress of the model creation in the “Training details” button and you should also keep in mind that the training may fail for several reasons, such as:
- There is insufficient or unrepresentative data, which makes the fine-tuning unstable.
- If the data are not sufficiently different from the public data on which the base model was trained, the training may fail or the quality of the code completion suggestions of the custom model may be only marginally improved.
- A data preprocessing step may encounter unexpected file types and formats that cause an error. Therefore, the solution may be to specify only certain file types for training.
On the other hand, you can update or delete the custom model from the organization’s configuration page. When you retrain the model, it is updated to include any new code that has been added to the repositories that were selected for training. You will be able to retrain it once a week.
How Plain Concepts can help you
GitHub Copilot is having a major impact on the way developers and organizations create software. According to research from Accenture, developers using Copilot experienced an 8% increase in change onboarding requests, a 15% increase in merge rates, and an 84% increase in release success rate.
The study also shows that 90% of developers were more satisfied with their work when using GitHub Copilot and 95% said they enjoyed coding more with this help.
Plain Concepts also ran a pilot test among our team to test its effectiveness, and you can see the initial results and conclusions here.
In addition, we believe that custom models represent the next big leap in coding, as you now extend these capabilities directly to the inline code completion experience.
By training Copilot on your private code bases and also incorporating telemetry, custom models allow Copilot to adapt to your organization’s unique coding environment in real-time. In addition, key steps have been taken to incorporate data security measures for optimized models at the scale that enterprises need.
Each company’s data will always be private and each company is the sole owner of it, it will never be used to train other customers’ models. In fact, when a training process is initiated, the data in your repository and telemetry data is tokenized and temporarily copied to the Azure training pipeline.
Some of this data is used for training, while another set is reserved for validation and quality assessment. Once the tuning process is complete, the model undergoes a series of quality assessments to ensure that it outperforms the reference model.
If the model passes the quality checks, it is deployed in OpenAI. This setup allows us to host several LoRA models at scale while keeping them isolated from each other. Once the process is complete, the temporary training data is removed and the data flow resumes through the normal inference channels.
From Plain Concepts, if you need a partner to uncover the full potential of GitHub Copilot, we make it easy for you:
- Somos el primer partner en España acreditado por GitHub.
- Llevamos más de 17 años trabajando en la cultura Agile referente en la comunidad DevOps.
- Contamos con un equipo especializado compuesto por más de 350 ingenieros senior en App Innovation y DevOps.
- Acreditados como AMMP.
- DevSecOps con MVPs.
In addition, we do not stop at certifications and we offer you an exclusive GitHub Adoption Framework to find the service that best suits your needs, from the best experts. Contact us to learn more!