Architecture Deep Dive: AI Agent Platform for Hardware CI Workflows
This is the internal AI agent platform I built at Icomera for hardware test workflows. The useful part was not adding an LLM to testing. The useful part was building it on top of a real test platform, real CI pipelines, and real devices so it could act inside an existing engineering workflow instead of staying in chat.
Foundation first
The agent came after the test platform, not before it. I had already built the underlying Python and pytest framework that ran against embedded devices, handled device topology and setup through Ansible, and ran in CI across multiple product lines. That meant the later agent work had something real to operate on: test cases, pipelines, devices, logs, runners, and merge requests.
The actual problem
Writing and validating a test was never just “generate some code.” The slow part was the manual loop around it. You had to read the manual steps, inspect devices, compare behavior across firmware versions, trigger pipelines, open a branch, create a merge request, rerun on different devices, and keep the whole process grounded in what the hardware was actually doing. That was the workflow I wanted to shorten.
What the system does
The first version connected to TestRail to read a test case, then used SSH to inspect the device and walk through the steps in a live environment. From there it could prepare code changes, open merge requests, trigger pipelines, and help iterate on the result until the test behaved correctly on the right device and image. Over time it grew into a broader internal product used for debugging failures, writing tests, observing live runs, and handling repeatable workflow tasks inside the team.
Flagship subsystem: Autonomous CI Observer
The observer is the clearest example of why the architecture mattered. Instead of waiting until a failed pipeline had already torn down the environment, GitLab triggers a webhook that starts an observation session. The path is: GitLab webhook, TypeScript orchestrator, SSH tail of structured JSONL events, event buffer and debounce, resumable Copilot session, live device investigation over SSH, then a structured update back to the engineering workflow.
It buffers those events and analyzes them while the run is still active. The system is designed for the messy reality of hardware testing: if the SSH connection drops because a device reboots mid-test, the observer does not blindly retry or crash. The analysis falls back to the event buffer captured before the connection dropped, which is often the only useful evidence left.
Architecture and execution context
The human interface was chat, but the system behind it was a TypeScript orchestrator. One important design decision was to isolate LLM execution from the Node process. Instead of calling model APIs directly from the service, the orchestrator spawned Copilot CLI as a child process with a bounded environment, timeout handling, workspace control, and trace capture.
That made per-user attribution possible for GitLab actions. When an engineer asked the agent to trigger a pipeline or open a merge request, the orchestrator passed a user-scoped GitLab credential into the subprocess. Commits, merge requests, and pipeline triggers were attributed to the person who requested them rather than a shared service account.
Routing and tool access were controlled through markdown-defined skills. Adding a new skill did not require changing the TypeScript service. It meant writing a markdown document that described the capability, prerequisites, and expected commands, then letting the orchestrator select and inject that skill when the request matched.
Concurrency & Git Worktrees
One of the hardest problems was concurrency. The agent can handle multiple simultaneous requests. If two LLM threads did a git checkout in the same repository clone, they would stomp on each other and corrupt the workspace.
To solve this, the orchestrator enforces concurrency boundaries at the filesystem level. The main repository checkout stays read-only. When an agent thread needs to edit code, a dedicated bash script dynamically creates an isolated Git worktree for that specific branch. The agent works inside this isolated directory, separated from other active agent threads. A garbage collector cleans up stale worktrees after 7 days.
Embedded retrieval and memory
Once the number of tasks grew, plain transcripts were not enough. I used hybrid retrieval (vector similarity plus keyword search) but instead of relying on external SaaS providers like Pinecone, I built a zero-infrastructure embedded store using better-sqlite3 and the sqlite-vec C-extension with FTS5.
Context injection is selective rather than automatic. If an active session already has enough local context, the system does not keep stuffing more memory into the prompt. If it needs more, it can pull from recent conversation, indexed memory, and a maintained internal wiki.
Tracing and guardrails
I made tracing a first-class part of the design. Every user request, model invocation, tool call, routing decision, and team-facing response is captured as structured trace data. That matters because the only sustainable way to improve a system like this is to make it inspectable. I also added guardrails around message length, unsafe prompts, concurrency, and session handling so the system stays bounded and debuggable when other people use it.
How it improved over time
The project changed once coworkers started using it. Real usage exposed where it was helpful, where it was brittle, and which requests kept repeating. Those repeated requests turned into features. The logging was not just for debugging. It also fed a nightly process that reviewed transcripts, traces, and session outputs, filtered out low-value noise, summarized the useful interactions, and folded the results back into the internal wiki and memory store.
What matters about it
The interesting part of this project is not that it used an LLM. The interesting part is that it was built on top of real systems with real constraints: CI pipelines, code changes, devices, sessions that run long enough to need resume, and workflows where a bad answer is worse than no answer. That is the difference between a chat demo and a system people actually rely on.
Some implementation details are intentionally kept high-level because the system was built inside a proprietary test and release environment.