Building state-of-the-art geospatial data visualization products at Echo Analytics
Revolutionizing geospatial data visualization with Echo Analytics' GeoPersona.
This series of articles unveils the technical innovations behind Echo Analytics' geospatial data visualization products. In this first article, we explore the breakthroughs that powered the development of GeoPersona, the first product integrated into the Echo platform.
Introduction
GeoPersona is an audience segmentation tool that helps our users reach their target audience based on real-world visitation patterns. A key component of this process has been leveraging Kepler.GL, a powerful open-source visualization tool, for mapping and visualizing GeoPersona data. Kepler.GL has been a valuable tool, enabling clear and intuitive visualizations of complex geospatial data. Those traces can be found in an article published earlier this year. While capable and efficient to visualize small datasets as a frontend-only solution, Kepler.GL is not as effective in managing large datasets, especially those above 500MB—roughly the size of a single GeoPersona segment in New York City.
Our objective is to enable seamless visualization of all GeoPersona segments across six countries. This includes instantaneous filtering, previews, and downloads, as demonstrated in the GIFs accompanying this article. In fact, the Echo Analytics’ platform is already able to visualize multiple TBs (denormalized) of GeoPerona data, while Kepler.GL can only handle around 500MB, making our solution thousands—if not millions—times more performant than Kepler.GL.
Today, we’ll demystify the core innovations that make it possible to navigate and interact with data on this scale. Specifically, we’ll dive into three pivotal technologies: Vector tilesets, frontend data joins, and multi-layer caching, technological aspects that enable performance that is thousands—if not millions—of times better than Kepler.GL.
Three pivotal technologies
There are three pivotal technologies that ensure the performance of our platform, especially when the size of data scales up. These technologies are:
- Vector Tilesets: Efficiently manage and serve map data, only downloading what’s needed for the current view.
- Frontend Data Joins: Minimize data duplication and enable dynamic combination of data from multiple sources.
- Four-Layer Caching: Optimize data retrieval and reduce redundant computations at every layer of the stack.
In the following sections, we will dive into each one of these technologies.
Vector tilesets
Vector tilesets are like map pieces made up of tiny bits of data that can be used to build a map. As shown in Figure 1, one square represents one tile, and the total ensemble of all the tiles is one tileset. These tiles are "vectors" because they store information as points, lines, and shapes (instead of images).
The number of tiles at each zoom level is 22n, where n is the zoom level.
For maps with 23 zoom levels, which is the range of zoom Mapbox provides, there are trillions of tiles per tileset.
Based on the user’s viewpoint, we can take latitude, longitude, zoom level, and screen size to determine which parts of the map (i.e., which tiles) to download.
f(latitude, longitude, zoom, width, height) => Corresponding tiles
For instance, if the user’s viewpoint is at zoom level 2, as represented in the red square in Figure 2, then the green tile in Figure 3 will be downloaded. If we consider one level of buffer when the user scrolls across the screen, then the yellow tiles will be preloaded.
Another advantage of vector tiles is their compact encoding using Google Protocol Buffers. Compared to JSON or GeoJSON, this format significantly reduces file size.
Implementation
We use Mapbox GL JS and its react wrapper React Map GL on the front end to request tiles.
On the back end, Martin is responsible for generating tiles from the PostGIS database.
When a PostgreSQL connection string is given, Martin will publish all tables that have at least one geometry column as data sources. All non-geometry columns will be published as vector tile features. The Martin server is run behind NGINX proxy, which caches frequently accessed tiles and reduces unnecessary pressure on the database.
The Echo difference
Kepler.GL often processes data as large GeoJSON files in memory, whereas Echo Analytics employs vector tilesets to partition polygon data into manageable chunks. This partitioning ensures that only the data relevant to the user's current view is downloaded. Vector tilesets also benefit from compact encoding formats like Google Protocol Buffers, achieving a much smaller footprint compared to Kepler.GL’s reliance on GeoJSON.
Discussion
With all the benefits mentioned above, is it a good idea to put everything into vector tilesets?
It is a straightforward solution, and we built the first version by putting all the attributes in this data schema into tilesets. However, we quickly realized that there is a lot of duplication without data normalization. Moreover, it’s cumbersome to extract data from the tileset when we want to use this data outside of the map; or to proload data that is not in the viewpoint. To solve these problems, we will dive into the next section: frontend data joins.
Frontend data joins
As previously mentioned, frontend data joins minimize data duplication and enable dynamic combination of data from multiple sources. Mapbox GL JS has a function called “setFeatureState”, which allows the front end to combine vector tile geometries with other data (such as JSON) dynamically. Being able to “join” on the front end means we can benefit from both the tileset and data normalization.
If we look closely into the GeoPersona data schema, the attributes can roughly be grouped into four categories: Geography Info (e.g., postal_code, region_name, country_code), Segments (e.g., geopersona_segment, brand_name), Segment Index (e.g., affinity_index_regional, affinity_index_national), and Demography (e.g., population, households, pop_age_0_to_5_years). Additional important attributes for data visualization that are not present in the data schema are the polygons for different admin boundaries (e.g., region polygons, zip code polygons).
Table 1 summarizes the data size and data variation for a given country.
Implementation
We normalize data into the five categories above. Smaller datasets (Geography Info, Segments, Demography, Segment Index) are served through RESTful APIs, while larger datasets (Polygons) remain in the tileset. Data is joined dynamically using setFeatureState with shared keys like region codes or zip codes, depending on the visualization granularity.
The Echo difference
Kepler.GL integrates all attributes into a single dataset before rendering. While this simplifies data preparation for small datasets, it leads to significant redundancy and ballooning file sizes for larger datasets.
In contrast, Echo Analytics leverages frontend data joins to normalize data, splitting it into smaller, reusable categories like geography, demographics, and segmentation indices. Only the relevant pieces are downloaded and joined dynamically at runtime. This approach not only reduces the overall data size but also enables incremental updates, allowing users to switch between views without reloading static data. Frontend data joins reduce the total GeoPersona dataset size from TBs (denormalized) to around 2GB (normalized). With more countries and segments adding to the product in the future, the difference will become even more significant.
Four-layer caching
Four-layer caching optimizes data retrieval and reduces redundant computations at every layer of the stack. Cache is a mechanism for improving data retrieval speed. It stores data so that future requests for that data can be served faster. There are four major layers of cache in the GeoPersona architecture.
Implementation
We run the Martin tileserver behind Nginx proxy. It stores the computed tilesets once they are requested for the first time. On the backend, the HTTP cache control header is set so that repeated HTTP requests within a given time will get an immediate response from data stored in the browser disk. On the front end, we use the Tanstack query for server state management. It stores cache data in the RAM. Compared to the cache control header, for the same data request, it will not even be an HTTP request.
The last layer is Zustand. It stores frontend computed data in RAM that is reused in different parts of the UI. This helps avoid heavy, repetitive frontend computations such as data joins.
The Echo difference
As a frontend-only solution, Kepler.GL processes data directly in the browser, so it only has caching for frontend computations. There’s no direct comparison to Echo Analytics, as Echo employs a four-layer caching strategy that supports pivotal technologies like vector tilesets and frontend data joins, optimizing both frontend and backend performance by minimizing redundant data retrieval and computation.
Other performance enhancements
Besides the three pivotal technologies, we are also using web worker technology for frontend computation, debouncing to avoid excessive function calls, and early data load for data that are very likely to be required in the subsequent user actions. These techniques also help boost application performance.
Conclusion
Geospatial data visualization is both fascinating and challenging. This project provided an incredible opportunity to develop technical expertise while collaborating with brilliant minds at Echo. The success of GeoPersona is a true team effort. While this article highlights the frontend and backend development, none of it would have been possible without the invaluable contributions of our designer, product manager, data, infrastructure, and ML engineers, along with many others.
Stay tuned for more articles showcasing the exciting technical innovations at Echo Analytics!