PenGemini 's key computing performance technologies

Rina7RS · Post by **Rina7RS** » Thu Jan 23, 2025 8:42 am

Recently, a syntax converter was added to convert promql syntax into influxql syntax, so that modules such as parsing and optimizer, and execution engine can be used based on the syntax structure of influxql. These modules have added modifications to adapt to prom to a certain extent, such as adding some prom query-specific operators to the physical execution operator. We currently have two storage engines: time series and column storage. Prom data can only use the time series engine now. The community will consider whether to adapt the column storage engine to support high-cardinality scenarios of prom indicator data in the future.

The above figure shows the partitioning method of openGemini data. Considering the time series characteristics of indicator data and the uniformity of data dispersion, two dimensions are used to partition the data. One is the range partition based on time, and the other is the hash partition based on the partition key. When the partition key is not specified, the indicator table and label timeline are used as the partition key by default. All partitions obtained in the time dimension, that is, shards, are japan mobile phone number list called a shardgroup. A total of two shardgroups are generated here. All shards obtained according to the partition key dimension are called a partition. A total of 6 partitions are generated here. Assuming that the cluster has a total of three nodes and each node has two partitions, then each node has 4 shard partitions. These concepts may be used when querying and writing data.

Next, we will focus on some key technologies in openGemini computing execution, which can bring computing advantages to promql.

• Joint computing push: Push computing tasks to the data source end for processing as much as possible, reduce network transmission, and use computing resources at the data source end to balance computing load

• Hierarchical parallel computing: The query is decomposed into multiple parallel subtasks at the top level. Each subtask executes part of the query independently, and finally collects and aggregates the query results.

• Vectorized batch computing: Organize data into ordered vectors or arrays, take advantage of SIMD ( Single Execution Multiple Data) to process data in batches, and improve data processing efficiency.