Geonix

Methodology

How the performance claims on this site are produced.

Overview

The performance figures cited on this site reflect the platform run against synthetic vehicle trajectory data, not against any specific employer's dataset. Synthetic data is generated using SUMO (Simulation of Urban MObility), an open-source traffic microsimulator developed by the German Aerospace Center (DLR), licensed under EPL-2.0.

Synthetic data generation

Cities: Tokyo and Osaka. Source road networks: OpenStreetMap extracts (ODbL). Vehicle count per simulated day: approximately 1 million vehicles per city (Tokyo and Osaka). Simulated duration: 1 day (24 hours). Output: synthetic GPS pings every 5 seconds per vehicle, written to per-day Parquet partitions. Total uncompressed size on disk: TODO_USER_COPY: size.

Pipeline architecture

Raw SUMO output → schema validation → Polars LazyFrame ingestion → semi-join pre-filter → HMM + Viterbi map matching (Rust, Rayon parallel) → enrichment and aggregation → Hive-partitioned GeoParquet output. The matcher uses a GraphHopper-exported road network for Tokyo and Osaka (FlatBuffers binary, loaded once at startup) and an R-tree spatial index over individual edge segments for sub-millisecond candidate search.

Benchmark setup

Hardware: TODO_USER_COPY: CPU, RAM, storage type. OS: TODO_USER_COPY: OS. Toolchain: rustc TODO_USER_COPY: version (release profile, codegen-units tuned). Pipeline configuration: production defaults (Rayon all-cores, Polars streaming on, R-tree maxNodeFill tuned). What is measured: wall-clock time from the matcher process start to GeoParquet output flush — includes graph load time, R-tree build, all I/O, and matching.

Results

TODO_USER_COPY: results paragraph — total vehicles ingested, total matched-link rows produced, wall-clock minutes, equivalent throughput in records/sec. Comparison to a Python baseline run on the same hardware, if available.

Caveats

Synthetic SUMO trajectories follow simulated driver behavior with idealized GPS sampling. Real-world probe data carries additional noise (GPS drift, dropouts, partial trips) that the production pipeline handles via preprocessing stages not exercised by SUMO output. Numbers here represent the matcher's peak performance on clean inputs; production throughput on noisy real-world data is typically within 50–80% of these figures. All benchmark numbers cited elsewhere on this site reflect runs on the synthetic dataset described above; the site does not publish figures from any specific employer engagement.