Segmenting satellite images using SAM and Grounding DINO
Read about segmenting satellite images using SAM and Grounding DINO for our data product, Shapes.
Introduction to Echo and Shapes
Echo's mission is to enable companies to innovate faster by making the world easy to understand. We have a wide variety of clients, including grocery chains, insurance companies, ad agencies, and more.
Our product line can be roughly broken down into three principal areas: Data (Places and Shapes), Insights (mobility data and GeoPersona), and Technology (Location SDK). For this article, we will focus on one of our data products: Shapes.
In brief, the Shapes dataset consists of geospatial polygon data, representing features such as building footprints, parking lots, solar panels, etc. Our Shapes dataset is integral to our business. We sell it directly to clients and it is also a necessary input to many of our other datasets. For instance, to compute the dwell time inside a particular place we need to know its boundaries.
Freely licensed datasets
It is highly complex to build a satellite image analysis pipeline from scratch. So why don't we simply use some of the available, freely licensed datasets for building footprints?
There are a few notable freely licensed datasets for building footprints:
For the most part, the Microsoft and Google datasets are either out of date or do not have sufficient coverage in the US or Europe–where most of our clients are located.
OSM is an open-source mapping project that is maintained by a highly active network of volunteers around the world, similar in concept to Wikipedia. It tends to be very high quality since other volunteers often manually define and check the footprints. Although it has good coverage in both the US and Europe, it has very limited coverage outside of large metropolitan areas, especially non-commercial areas such as suburbs or industrial neighborhoods.
OMF is a relatively new initiative that aims to combine multiple open-source and freely licensed geospatial datasets. Their building footprint data is mostly sourced from OSM and therefore has the same limitations.
There are several reasons why these datasets are not sufficient for our purposes:
- As mentioned, most of these datasets are either out of date or have insufficient coverage. OSM is the best one, but even that is patchy outside of major metropolitan areas.
- All of these datasets are primarily focused on building footprints (or roads, in the case of OSM) and have low coverage for non-building features outside urban areas, such as parking lots, trees, and solar panels. While OSM includes non-building features, its coverage is even less reliable outside urban areas than the building footprints and roads our customers might be interested in.
- For the initiatives that are effectively industrial consortiums, such as OMF, these datasets will only be updated as long as their members continue to benefit from them. Not being in control of a critical data source for our company is a bad idea.
- Most importantly, if we are able to make use of these datasets, then so can all of our competitors. To maintain a market advantage, we need to develop our own datasets and solutions. Of course, when it makes sense we can augment and enhance our own datasets with freely licensed data, but it should not be our only data source.
Having established the business need, let's get into the implementation.
SAM
The Segment Anything Model (SAM) is an open-source foundational model for image segmentation that was released by Meta in April 2023. It has excellent zero-shot performance on high-resolution images.
Although it was trained on high-resolution photographs, it has decent performance on satellite images. It works best on resolutions that are less than one meter, and when used with point-based prompts (as opposed to bounding box prompts). Not surprisingly, it can struggle with features that have similar materials, such as the boundary between a sidewalk and a road, and small irregular features, such as trees, and shadows.
Grounded SAM
In order to know which features of an image to segment, we need to combine it with some kind of object detection model. After some preliminary testing, we saw promising results using Grounded SAM.
The approach is fairly simple. We generate an embedding of the satellite image using Grounding DINO and the text prompt of the feature we want to identify. This generates a bounding box in the pixel space, which we can then use as a prompt for SAM (or use the center of the bounding box as a point prompt, since that works better for satellite images).
By fine-tuning Grounding DINO with annotated data, we can further improve the embedding performance. However, even the zero-shot performance is surprisingly good, given that neither Grounding DINO nor SAM were initially trained on satellite image data.
Proposed architecture
Given the zero-shot performance of Grounded SAM, we are optimistic that fine-tuning the Grounding DINO encoder using manually annotated data will yield improvements. Thanks to using foundational models, our high-level architecture is conceptually fairly simple.
The most computationally intensive step is computing the image embeddings for each satellite image. Although the images we are working with are considered high resolution in terms of satellite imagery (~30 cm per pixel), with image processing they are easy to manage. Consider that one hectare of satellite imagery at the resolutions we are working with is only about 350x350 pixels, and most of the features we are interested in analyzing are much smaller than even a hectare.
Image embeddings also only need to be computed once for each area and each type of embedding, and can then be reused for different runs of the object detection and segmentation models that target the same area. For some metropolitan areas, we may need to refresh the images every few months to account for new construction, but we can essentially treat embedding computation as a one-time cost.
The other problematic step is annotation. We have partnered with Encord, a specialized image annotation service, to annotate the images that we source from our satellite image provider. As with computing embeddings, although annotation is a time-consuming and expensive process, it is the only reliable way to generate ground-truth data. As we accumulate annotated images, it will also provide us with a competitive advantage as we will be able to fine-tune our models faster than the competition.
We use Google (GCP) as our cloud platform and chose to use Apache Beam as our distributed computing framework to accelerate the processing time since GCP provides an execution engine for Beam called Dataflow which is fairly full-featured. With our image processing pipeline incorporating custom Python libraries as well as large external dependencies–like machine learning models–we will need to use a flexible computing framework such as Beam or Spark rather than a more structured data handling platform like BigQuery.
At a high level, we plan to use Beam for any distributed computing steps and then make the final dataset available as OGC Simple Features.
Our Shapes product delivers quality building footprint data, giving context to locations. It harnesses AI, machine learning, and computer vision to transform raw data into actionable insights. Developing our own datasets and solutions using SAM and Grounding DINO puts us in a unique position to circumvent some of the limitations that freely licensed datasets face.
Stay tuned for future articles detailing our progress as we implement this solution.