Final Project

Due: Tuesday, April 28 at 11:59 PM ET

1. Weekly Meetings [20 points]

Each team must meet weekly with their assigned TA to track progress and set milestones. Scheduling is flexible and should be coordinated directly with your TA. There must be 3 meetings between the proposal deadline and the final project deadline, and both project members must be present at every meeting.

Come to each meeting prepared to discuss: what you completed that week, what you will complete before the next meeting, and any issues that are blocking progress. These meetings are meant to ensure steady progress and no last-minute issues. Teams should use this time to get feedback on design, evaluation, and demo readiness.

2. Your Agent [40 points]

Deliverables:

Code: A zip file containing your code, along with a README.md file that explains how to run your agent.
Design Document: A PDF writeup that clearly explains your agent implementation and design choices.

Required Agent Components:

Data incorporation: Your agent must incorporate external data relevant to the task. You should collect, parse, clean, and organize this data into a form your system can reliably use.
Memory and retrieval: Your agent must maintain a memory store and have a retrieval mechanism over that memory or knowledge base. The system should be able to retrieve relevant information when needed.
Three tools: Your agent must be able to call at least 3 meaningful tools. These tools should play a real functional role in solving the task, not exist only to satisfy the requirement.
Robust system design: Your system should have a clear agentic workflow with explicit logic for how it processes inputs, routes decisions, uses memory, calls tools, and terminates. The design should be modular and understandable.
Guardrails and exception handling: Your system must include guardrails and handle failure at all stages of execution. You should show how the agent checks user inputs, constrains or verifies tool calls, and validates outputs before returning them.
Evaluation framework: You must build an evaluation framework that tests whether your agent succeeds on intended tasks and how it behaves on edge cases, failures, and adversarial inputs.

Design Document Requirements:

Your design document should clearly explain your full system and justify your design choices. The document should begin by defining the problem, intended user, motivation, and project scope. You should then provide an end-to-end description of the system, explaining how the agent moves from user input to final output. It should include a detailed discussion of each required agent component: data incorporation, memory and retrieval, three tools, robust system design, guardrails, and evaluation. For each component, explain what you implemented, why you designed it that way, and any important alternatives, tradeoffs, or limitations.

The document must also include two diagrams. First, include a system flow graph showing the complete agent pipeline, including possible user inputs, major decision points, memory reads and writes, tool calls, guardrail checks, branches, and termination conditions. Second, include a threat model that identifies plausible risks such as unsafe user input, prompt injection, unsafe tool use, privacy leakage, or harmful outputs, and clearly indicates where validation and guardrails are applied in the system. These diagrams can be implemented in any format you choose, as long as all needed information is shown.

Finally, describe your evaluation methodology, including the kinds of tasks, success criteria, failure cases, and adversarial cases you test. You must also include three user transcripts that show meaningfully different system behaviors, such as a successful case, a difficult or ambiguous case, and a failure-handling or safety case.

3. Demo [40 points]

You are able to use late days only for the deliverables. Your demo must be done on April 28th, in class.

You will give a final presentation and live demonstration of your agent. The total presentation time must be 12 minutes. It should include a clear explanation of the problem, an overview of the full agent design, and a ~5 minute live demo showcasing your agent's abilities.

Your demo should provide evidence that the agent actually contains the required components. At minimum, you should show how user input is processed, how data is incorporated, how memory and retrieval are used, how the system selects or calls tools, how guardrails intervene when needed, and what the final output looks like. The live demo should make clear that the architecture is real and functioning as designed.

FAQ

How do late days work for the final project?

If you use late days on the final project, both project partners must each use a late day. One partner cannot use a late day on behalf of the team.

What if we run into rate limits?

You have one month to complete this project, so you should plan your development process accordingly. We expect teams to work around rate limits through good engineering practice: start early, test incrementally, cache intermediate outputs where appropriate, and avoid leaving all large-scale experimentation or evaluation to the end. In emergency cases, you may make additional Mistral or Hugging Face accounts to reset your credit usage, but do not rely on this.

What model should we use?

You may choose any free model that best fits your system design, but as a general guideline, use Kimi-K2.5 (download demo code here) for larger or more complex tasks and Mistral for smaller, lighter-weight tasks. Part of the project is making reasonable design decisions, so you should be able to justify your model choice based on the needs of your agent.