GridSight Desktop Automation

I designed a vision-based cursor targeting system that replaces unreliable pixel-coordinate detection with a recursive alphanumeric grid architecture. The screen is divided into a region map; the model first selects the most probable region based on the user’s intent. That region is then subdivided into a finer grid , and the model returns a single cell identifier rather than attempting to hallucinate absolute coordinates. If verification fails, the chosen cell is further subdivided into a subgrid for refinement. This hierarchical approach removes DPI scaling errors, mitigates hallucination, and ensures deterministic cursor movement. This system uses vision over OCR whenever possible to maintain robustness across different layouts. My long-term plan is to integrate YOLOv8 detection for fast, on-device UI element recognition and fuse it with the grid-based approach. This targeting engine is being built as a core component of my larger AI desktop control project, where multiple reasoning, perception, and automation layers work together to let an AI reliably see, understand, and control a real computer.

I wanted a way for AI to actually understand what’s on my screen and move the mouse to the right place reliably, every time. The breakthrough was realising that instead of asking the AI, “Where is this button in pixels?”, I could divide the screen into a labelled grid and ask it a much simpler question: “Which box is it in?”.

I wanted a way for AI to actually understand what’s on my screen and move the mouse to the right place reliably, every time. The breakthrough was realising that instead of asking the AI, “Where is this button in pixels?”, I could divide the screen into a labelled grid and ask it a much simpler question: “Which box is it in?”.

I wanted a way for AI to actually understand what’s on my screen and move the mouse to the right place reliably, every time. The breakthrough was realising that instead of asking the AI, “Where is this button in pixels?”, I could divide the screen into a labelled grid and ask it a much simpler question: “Which box is it in?”.

Next work