Aggregating Data
Dynamic configuration and annotations deliver real-time data cleaning and transformation. However, they also introduce challenges for large-scale Aggregations (e.g. Cumulative Sum), because queries must evaluate the chain of configs and annotations for every datapoint.
Aggregating Data in Release 2
Section titled “Aggregating Data in Release 2”Release 2 focused on automated ingestion, long-term storage, and data exploration and management. For aggregating data over arbitrary time windows, the client has to retrieve and process every underlying datapoint so that annotations can be applied correctly.
For a few specific cases, such as cumulative rainfall, Release 2 added a custom caching mechanism: it periodically queries its own API, stores post-annotation results, and reuses those values for fast aggregations. That works for rainfall, but it adds complexity and storage overhead, is hard to generalize, and does not scale to other aggregation types.
Aggregating Data in Release 3
Section titled “Aggregating Data in Release 3”Prior to the kick-off of Release 3, a real-time data aggregation proof-of-concept was developed to test whether the system could efficiently combine data across multiple backing stores while correctly applying all annotations in a single pass. The PoC proved successful and drove three important design shifts that now shape Release 3 development.
-
Mindset shift — The dynamic querying layer is more than a conventional API. It is a Datastream Query Engine that selects the right datapoints configs by time interval, performs annotation actions, queries the appropriate backing stores, and returns one coherent time series to the client.
-
Design shift — Aggregations must be first-class operations within the system rather than post-processing steps in application code. The design should utilize a streaming pipeline that can push filters and pre-aggregations down to the source (where supported) so less raw data crosses the wire, while applying the full configuration and annotation model along the way.
-
Language shift — A query engine must do intensive numeric work and coordinate many concurrent reads on every request. Release 2’s
dendra-web-apihandled its API workload well, offloading the occasional heavy job to worker threads — but making that the constant workload pushes the Node.js stack against two costs:- Numeric math — Release 2 evaluates annotation expressions and conversions through math.js
BigNumber. Interpreted, arbitrary-precision arithmetic is far slower and more memory-hungry per datapoint than compiled native math. - Concurrency — Node reaches the required throughput only by hand-managing worker pools and connection pooling, and by copying data between workers to emulate the shared-memory parallelism the runtime lacks.
To optimize the query engine, the Dendra team made the shift to Go, where native numeric types and goroutines address both of these challenges more directly. Additionally, conversions and custom annotation calculations can run as compiled functions (WebAssembly), replacing the interpreted expressions.
- Numeric math — Release 2 evaluates annotation expressions and conversions through math.js