Home

GenAI

GenAI

Code Generation

Example Data

The data for fine-tuning a code generation model consists of pairs of natural language instructions (often in comments or docstrings) and their corresponding code implementations.

  • Input (Natural Language Prompt):
      # Write a Python function that takes a list of numbers 
      # and returns a new list with only the even numbers.
    
  • Target (Code Completion):
      def get_even_numbers(numbers):
          """
          Filters a list of numbers, returning only the even ones.
          """
          even_numbers = []
          for number in numbers:
              if number % 2 == 0:
                  even_numbers.append(number)
          return even_numbers
    
Use Case Scenario

The goal is to automatically generate correct, efficient, and syntactically valid code from a natural language description or a partial code snippet. This significantly speeds up the software development process.

  • AI Pair Programming (e.g., GitHub Copilot): A developer is working in their code editor (like VS Code). They type a comment: // Create a function to fetch user data from the API endpoint '/api/users'. The AI assistant instantly generates the complete function with the correct syntax for making an HTTP request.
  • Natural Language Data Analysis: A data scientist in a Jupyter Notebook types: "Plot the average house price by neighborhood from the 'san_jose_housing' dataframe." The model generates the necessary Python code using libraries like pandas and matplotlib to perform the calculation and create the visualization.
  • Automated Unit Testing: A developer writes a function, and the AI can automatically generate a suite of unit tests to verify its correctness.

    How It Works: A Mini-Tutorial

    The core insight behind code generation is that code is just a highly structured form of text. It has a strict grammar, syntax, and logical patterns. Therefore, LLMs, which are expert pattern recognizers, are exceptionally good at this task. The dominant architecture is the Decoder-Only model.

The Training Phase ✍️
  1. The Data: Code models are pre-trained on a massive corpus of text and code. The data comes from two main sources:
    • Public Code Repositories: Billions of lines of code from sources like GitHub are used. The model learns the syntax, structure, and common patterns of many programming languages (code -> code prediction).
    • Paired Data: To learn how to follow instructions, models are specifically trained on pairs of natural language and code. This data is mined from docstrings, code comments, programming tutorials, and Q&A sites like Stack Overflow (natural language -> code prediction).
  2. Tokenization: Code models often use a specialized tokenizer. Unlike a standard text tokenizer, a code tokenizer is optimized to handle programming constructs like indentation (which is critical in Python), brackets ({}, [], ()), operators (++, ->, :=), and common variable names.

  3. Input Formatting: The training data is formatted into a single continuous sequence, just like other generative tasks. For an instruction-following pair, it would look like: "<instruction_comment>" <separator> "<code_implementation>"

  4. Architecture & Loss: The setup is identical to other text generation tasks.
    • The model is a standard decoder-only architecture (e.g., GPT, Llama, Codex).
    • It uses Causal Masking, meaning when predicting the next token, it can only see the code and comments that came before it.
    • The loss function is Cross-Entropy Loss, calculated on the model’s predictions for the next token against the actual next token in the training data. For instruction-following pairs, the loss might be calculated only on the code tokens (the completion), not the instruction tokens (the prompt).
The Inference Phase (Writing Code) 👨‍💻
  1. The Prompt: The user provides a prompt. This can be a natural language comment, a function signature, or the beginning of a line of code.
    • Example Prompt: def send_email(recipient, subject, body):
  2. The Generation Loop: The model takes this prompt as its initial input and begins to generate the code autoregressively, one token at a time.

  3. The Autoregressive Process: This is the same loop used in all generative tasks:
    • Predict: The model uses its final linear layer and a softmax function to get a probability distribution over all possible next tokens in its vocabulary.
    • Sample: A token is chosen from this distribution. For code generation, the sampling is often less random (using a lower “temperature”) than for creative writing, because correctness and predictability are more important than creativity.
    • Append: The newly chosen token is appended to the sequence, and this new, longer sequence becomes the input for the next step.
    • The loop might generate: """Sends an email... then """ then import smtplib and so on.
  4. Stopping Condition: The generation continues until the model determines the code block is logically complete (e.g., it has closed all brackets and returned from the function) or it generates a special end-of-sequence token.