Tech

GeoPersona

Building state-of-the-art geospatial data visualization products at Echo Analytics

Revolutionizing geospatial data visualization with Echo Analytics' GeoPersona.

7 min read

- Published on

December 3, 2024

This series of articles unveils the technical innovations behind Echo Analytics' geospatial data visualization products. In this first article, we explore the breakthroughs that powered the development of GeoPersona, the first product integrated into the Echo platform.

Introduction

GeoPersona is an audience segmentation tool that helps our users reach their target audience based on real-world visitation patterns. A key component of visualizing GeoPerona data has been leveraging Kepler.GL, a powerful open-source visualization tool, for mapping and visualizing GeoPersona data. Those traces can be found in an article published earlier this year. While capable and efficient to visualize small datasets as a frontend-only solution, Kepler.GL is not as effective in managing large datasets, especially those above 500MB—roughly the size of a single GeoPersona segment in New York City.

Our objective is to enable seamless visualization of all GeoPersona segments across six countries. This includes instantaneous filtering, previews, and downloads, as demonstrated in the GIFs accompanying this article. In fact, the Echo Analytics’ platform is already able to visualize multiple TBs (denormalized) of GeoPerona data, while Kepler.GL can only handle around 500MB, making our solution thousands—if not millions—times more performant than Kepler.GL.

Today, we’ll demystify the core innovations that make it possible to navigate and interact with data on this scale. Specifically, we’ll dive into three pivotal technologies: Vector tilesets, frontend data joins, and multi-layer caching, technological aspects that enable performance that is thousands—if not millions—of times better than Kepler.GL.

Three pivotal technologies

There are three pivotal technologies that ensure the performance of our platform, especially when the size of data scales up. These technologies are:

Vector Tilesets: Efficiently manage and serve map data, only downloading what’s needed for the current view.
Frontend Data Joins: Minimize data duplication and enable dynamic combination of data from multiple sources.
Four-Layer Caching: Optimize data retrieval and reduce redundant computations at every layer of the stack.

In the following sections, we will dive into each one of these technologies.

Vector tilesets

Vector tilesets are like map pieces made up of tiny bits of data that can be used to build a map. As shown in Figure 1, one square represents one tile, and the total ensemble of all the tiles is one tileset. These tiles are "vectors" because they store information as points, lines, and shapes (instead of images).

The number of tiles at each zoom level is 2²ⁿ, where n is the zoom level.

For maps with 23 zoom levels, which is the range of zoom Mapbox provides, there are trillions of tiles per tileset.

‍

‍

Based on the user’s viewpoint, we can take latitude, longitude, zoom level, and screen size to determine which parts of the map (i.e., which tiles) to download.

f(latitude, longitude, zoom, width, height) => Corresponding tiles

For instance, if the user’s viewpoint is at zoom level 2, as represented in the red square in Figure 2, then the green tile in Figure 3 will be downloaded. If we consider one level of buffer when the user scrolls across the screen, then the yellow tiles will be preloaded.

Another advantage of vector tiles is their compact encoding using Google Protocol Buffers. Compared to JSON or GeoJSON, this format significantly reduces file size.

Implementation

We use Mapbox GL JS and its react wrapper React Map GL on the front end to request tiles.
On the back end, Martin is responsible for generating tiles from the PostGIS database.
When a PostgreSQL connection string is given, Martin will publish all tables that have at least one geometry column as data sources. All non-geometry columns will be published as vector tile features. The Martin server is run behind NGINX proxy, which caches frequently accessed tiles and reduces unnecessary pressure on the database.

The Echo difference

Kepler.GL often processes data as large GeoJSON files in memory, whereas Echo Analytics employs vector tilesets to partition polygon data into manageable chunks. This partitioning ensures that only the data relevant to the user's current view is downloaded. Vector tilesets also benefit from compact encoding formats like Google Protocol Buffers, achieving a much smaller footprint compared to Kepler.GL’s reliance on GeoJSON.

Discussion

With all the benefits mentioned above, is it a good idea to put everything into vector tilesets?

It is a straightforward solution, and we built the first version by putting all the attributes in this data schema into tilesets. However, we quickly realized that there is a lot of duplication without data normalization. Moreover, it’s cumbersome to extract data from the tileset when we want to use this data outside of the map; or to proload data that is not in the viewpoint. To solve these problems, we will dive into the next section: frontend data joins.

Audience segmentation with real-world mobility data.

Learn More

Frontend data joins

As previously mentioned, frontend data joins minimize data duplication and enable dynamic combination of data from multiple sources. Mapbox GL JS has a function called “setFeatureState”, which allows the front end to combine vector tile geometries with other data (such as JSON) dynamically. Being able to “join” on the front end means we can benefit from both the tileset and data normalization.

If we look closely into the GeoPersona data schema, the attributes can roughly be grouped into four categories: Geography Info (e.g., postal_code, region_name, country_code), Segments (e.g., geopersona_segment, brand_name), Segment Index (e.g., affinity_index_regional, affinity_index_national), and Demography (e.g., population, households, pop_age_0_to_5_years). Additional important attributes for data visualization that are not present in the data schema are the polygons for different admin boundaries (e.g., region polygons, zip code polygons).

Table 1 summarizes the data size and data variation for a given country.

Implementation

We normalize data into the five categories above. Smaller datasets (Geography Info, Segments, Demography, Segment Index) are served through RESTful APIs, while larger datasets (Polygons) remain in the tileset. Data is joined dynamically using setFeatureState with shared keys like region codes or zip codes, depending on the visualization granularity.

The Echo difference

Kepler.GL integrates all attributes into a single dataset before rendering. While this simplifies data preparation for small datasets, it leads to significant redundancy and ballooning file sizes for larger datasets.

In contrast, Echo Analytics leverages frontend data joins to normalize data, splitting it into smaller, reusable categories like geography, demographics, and segmentation indices. Only the relevant pieces are downloaded and joined dynamically at runtime. This approach not only reduces the overall data size but also enables incremental updates, allowing users to switch between views without reloading static data. Frontend data joins reduce the total GeoPersona dataset size from TBs (denormalized) to around 2GB (normalized). With more countries and segments adding to the product in the future, the difference will become even more significant.

Four-layer caching

Four-layer caching optimizes data retrieval and reduces redundant computations at every layer of the stack. Cache is a mechanism for improving data retrieval speed. It stores data so that future requests for that data can be served faster. There are four major layers of cache in the GeoPersona architecture.

Implementation

We run the Martin tileserver behind Nginx proxy. It stores the computed tilesets once they are requested for the first time. On the backend, the HTTP cache control header is set so that repeated HTTP requests within a given time will get an immediate response from data stored in the browser disk. On the front end, we use the Tanstack query for server state management. It stores cache data in the RAM. Compared to the cache control header, for the same data request, it will not even be an HTTP request.

The last layer is Zustand. It stores frontend computed data in RAM that is reused in different parts of the UI. This helps avoid heavy, repetitive frontend computations such as data joins.

The Echo difference

As a frontend-only solution, Kepler.GL processes data directly in the browser, so it only has caching for frontend computations. There’s no direct comparison to Echo Analytics, as Echo employs a four-layer caching strategy that supports pivotal technologies like vector tilesets and frontend data joins, optimizing both frontend and backend performance by minimizing redundant data retrieval and computation.

Other performance enhancements

Besides the three pivotal technologies, we are also using web worker technology for frontend computation, debouncing to avoid excessive function calls, and early data load for data that are very likely to be required in the subsequent user actions. These techniques also help boost application performance.

Conclusion

Geospatial data visualization is both fascinating and challenging. This project provided an incredible opportunity to develop technical expertise while collaborating with brilliant minds at Echo. The success of GeoPersona is a true team effort. While this article highlights the frontend and backend development, none of it would have been possible without the invaluable contributions of our designer, product manager, data, infrastructure, and ML engineers, along with many others.

Stay tuned for more articles showcasing the exciting technical innovations at Echo Analytics!

FAQ

What is geospatial data?‍

Geospatial data is information that includes the location of physical objects or events on Earth. It often comes with attributes like coordinates, addresses, time, and descriptive metadata.

How is geospatial data used in visualization?‍

Geospatial data is visualized using maps, heatmaps, choropleths, and 3D models to show patterns, movements, and relationships across space. These visuals help analysts and businesses make location-based decisions.

What are examples of geospatial data?‍

Examples include store locations, traffic patterns, population density maps, satellite imagery, and anonymized mobile device signals.

Why is geospatial data important for businesses?‍

It helps companies understand customer behavior, optimize site selection, personalize marketing campaigns, and streamline logistics based on real-world location data.

How does Echo Analytics use geospatial data in its products?‍

Echo Analytics provides high-quality geospatial data that can be easily integrated into existing data visualization platforms, helping businesses gain real-time insights into foot traffic, mobility, and market dynamics.