What is cvat?
CVAT (Computer Vision Annotation Tool) is a free, open-source web application that lets people draw boxes, lines, masks, and other labels on images and videos so computers can learn to recognize what’s in them.
Let's break it down
- Computer Vision: teaching computers to see and understand pictures or video.
- Annotation: adding notes or marks (like boxes or outlines) that tell the computer what each part of the picture is.
- Tool: a software program you use to do the work.
- Open-source: anyone can look at the code, use it for free, and change it if they want.
- Web application: you run it in a browser, not on a single desktop program.
- Images and videos: the kinds of visual data you can label.
Why does it matter?
Good labels are the foundation of any computer-vision model; without accurate annotations the AI will learn the wrong things. CVAT makes labeling faster, cheaper, and easier for teams, helping projects move from data collection to a working model more quickly.
Where is it used?
- Autonomous driving: labeling street-level photos and video to teach cars to spot pedestrians, traffic signs, and other vehicles.
- Medical imaging: outlining tumors or organs in scans so AI can assist doctors with diagnosis.
- Retail & e-commerce: marking products on shelf photos for inventory tracking and visual search.
- Robotics: annotating objects in a robot’s camera view so it can grasp or avoid them safely.
Good things about it
- Free and open-source - no licensing fees.
- Runs in a browser, so multiple users can work together on the same project.
- Supports many annotation types (boxes, polygons, masks, points, tracks) and many file formats.
- Built-in tools for video frame interpolation speed up labeling of video sequences.
- Easy integration with popular machine-learning pipelines (e.g., exporting to COCO, YOLO, TFRecord).
Not-so-good things
- Requires a server (Docker or similar) to set up, which can be a hurdle for non-technical users.
- Can become slow or laggy when handling very large datasets or high-resolution video.
- The user interface, while functional, feels less polished than some commercial alternatives.
- Advanced features like semi-automatic labeling or AI-assisted suggestions are limited compared to paid tools.