Pisanix v0.3.0 is Released, Introducing Data Sharding Strategy!

Data sharding is an effective solution to deal with massive data storage and computation.

Pisanix, a database mesh solution sponsored by SphereEx, now provides data sharding governance capability based on the underlying database - allowing users to scale out computing and storage.

Starting from v0.3.0, Pisanix will gradually support data sharding, with this release supporting single database sharding.

1. Introducing data sharding

As shown in Figure 1 below, data sharding mainly consists of SQL parse, SQL rewriting, SQL route, SQL execution, and result merge.

db mesh figure1.png

Concepts:

SQL Parse: During the sharding process, once the request is received by Pisa-Proxy, it will first go through SQL Parser, and the SQL will be parsed into AST. SQL Rewriting: After parsing, Pisa-Proxy will rewrite the current SQL statement according to the sharding rules to generate the real SQL statement to be executed. SQL Route: Pisa-Proxy routes the rewritten SQL statements to the corresponding data source at the backend to execute the SQL statements according to the sharding rules. SQL Execution: Pisa-Proxy will rewrite the SQL statement and push it down to the back-end real database for execution. Result Merge: Pisa-Proxy merges the query results and returns them to the client.

Introducing SQL Rewriting

SQL rewriting is a critical module in data sharding. Pisa-Proxy needs to rewrite the current SQL statement according to the sharding rules to generate the real SQL statement to be executed. SQL rewriting can be of the following types:

Identifier Rewriting

Identifiers to be rewritten include table names, index names, and schema names.

Table name rewriting means the process of finding the location of a logical table in the original SQL and rewriting it to a real table. Table name rewriting is a typical scenario that requires parsing of SQL. For example, if the logical SQL is:

SELECT order_id FROM order.t_order WHERE order_id = 1;

Suppose that the shard key of this table is order_id and order_id=1, and the number of sharding is specified as two, then the SQL statement would be as follows:

SELECT order_id FROM order.t_order_00001 WHERE order_id = 1;

The following figure shows the data query process:

db mesh figure2.png

Taking data writing as an example, the data insertion process is as follows:

db mesh figure3.png

Note: When the SQL rewriting process modifies the identifier to calculate the real table name, it will automatically add the table index according to the sharding rule. The index rule is table name_index, and the index bit is five. For example, the t_order table is overwritten to t_order_00000. Therefore, you need to create the corresponding table name based on the actual business scenario.

Column Supplement Rewriting

There are two cases that need supplement columns in a query statement. In the first case, Pisa-Proxy needs to get the data during the result merge, but the data is not returned by the queried SQL.

In this case, it mainly applies to GROUP BY and ORDER BY. When merging the results, you need to group and order the field items according to GROUP BY and ORDER BY, but if the original SQL does not contain grouping or ordering items in the selections, you need to rewrite the original SQL.

For instance, with an instance with the following SQL statement:

SELECT order_id, user_id FROM t_order ORDER BY user_id;

Since user_id is used for sorting, the data of user_id needs to be retrieved in the result merge, the above SQL statement contains the data of user_id, so there is no need to add columns, and modifying the identifier is enough.

SELECT order_id FROM t_order ORDER BY user_id;

This SQL depends on the user_id for sorting. Therefore, the column must be supplemented. The rewritten SQL is as follows:

SELECT order_id, user_id AS USER_ID_ORDER_BY_DERIVED_00000 FROM t_order_00000 ORDER BY user_id;

The second case of column supplement is the use of AVG aggregate functions.

In distributed scenarios, using (avg1 + avg2 + avg3)/3 to calculate the average is incorrect and should be rewritten as (sum1 + sum2 + sum3) /(count1 + count2 + count3).

In this case, rewriting the SQL containing AVG to SUM and COUNT is required, and recalculating the average when the results are merged. For example:

SELECT AVG(price) FROM t_order WHERE user_id = 1;

The rewritten SQL is as follows:

SELECT COUNT(price) AS AVG_DERIVED_COUNT_00000, SUM(price) AS AVG_DERIVED_SUM_00000 FROM t_order_00000 WHERE user_id = 1;

Configuration Description

This release supports query, update, deletion, and modification of the single database sharding based on a single shard key. The configuration items are as follows:

configuration description.png

Note: The broadcast table, binding table, sub-query, table sharding, distributed rules based on expressions, distributed transaction, and cross-database Join will be gradually supported in later versions.

Taking the scenario in Figure 2 and Figure 3 as an example, its corresponding CRD configuration is as follows:

# Declare a VirtualDatabase as a logical database
apiVersion: core.database-mesh.io/v1alpha1
kind: VirtualDatabase
metadata:
  name: test
  namespace: default
spec:
  services:
  - databaseMySQL:
      db: test
      host: 127.0.0.1
      password: "root"
      port: 3306
      user: root
    name: mysql
    trafficStrategy: test
    dataShard: test
---
# Declare TrafficStrategy the specified proxy
apiVersion: core.database-mesh.io/v1alpha1
kind: TrafficStrategy
metadata:
  name: test
  namespace: default
spec:
  loadBalance:
    simpleLoadBalance:
      kind: random
  selector:
    matchLabels:
      source: test
---
# Declare DataShard the sharding rules
apiVersion: core.database-mesh.io/v1alpha1
kind: DataShard
metadata:
  name: test
  namespace: default
  labels:
    source: test
spec:
  rules:
  - tableName: "t_order"
    tableStrategy:
      tableShardingAlgorithmName: "mod"
      tableShardingColumn: "id"
      shardingCount: 2
    actualDatanodes:
      valueSource:
        nodes:
        - value: "ds001"
---
# Declare DatabaseEndpoint the identified physical database
apiVersion: core.database-mesh.io/v1alpha1
kind: DatabaseEndpoint
metadata:
  labels:
    source: test
  name: ds001
  namespace: default
spec:
  database:
    MySQL:
      db: test
      host: mysql.default
      password: root
      port: 3306
      user: root

2. Pisanix v0.3.0 Overview

This release has been made possible thanks to 102 merged PRs by the following contributors:

Pisanix 3 contributors.png

New Features

Pisa-Controller
- Support DataShard CRD #326
Pisa-Proxy
- Support single database sharding #338

Enhancements

Pisa-Controller
- Optimized Sidecar injection #234
- Support generic static read/write splitting rule #251
Pisa-Proxy
- Support SHOW STATUS parser #254
- Introduce CloudWatch Sinker for later audit events #258
- Support CREATE TABLESPACE parser #259
- Support CREATE SERVER #261
- Front MySQL protocol uses tokio codec #263
- Support dynamic read/write splitting Monitor switch #264
- Refactor MySQL server runtime #274
- Experimental SIMD in SQL Parser #278

Fixes

Pisa-Proxy
- Fix Apple M1 Macbook build failure in some cases #260
- Fix malformed SQL format in some cases #291
- Fix abnormal SQL parsing issues #297
- Fix unexpected AST in some cases #301
- Fix Pisa-Proxy crash without receiving VirtualDatabase configuration #319

Others

Golang-SDK
- Move CRD and kubernetes client to github.com/database-mesh/golang-sdk #247
- Add DataShard CRD and update VirtualDatabase with DataShard as new spec member in Golang-SDK
Docs
- Add sharding doc: https://www.pisanix.io/docs/Features/sharding
- Optimized microservices-demo deployment instruction #318
- Fork and optimize the code and configuration of github.com/microservices-demo/microservices-demo for a better demo. See: github.com/database-mesh/microservices-demo
Charts
- Update controller and proxy mirror to v0.3.0

Known Issues

DO NOT support SHOW DATABASES and SHOW TABLES
DO NOT support observability under sharding mode

3. Community

Database Mesh2.png

Pisanix is an open source implementation of Database Mesh. It builds unified database governance based on Rust, and aims at providing four major development and application experiences: local database, unified configuration management, multi-protocol support, and cloud-native architecture.

The community is currently collecting more user scenarios and applications. If you've started or are planning to test Pisanix, please record it in the issue below.

The community will prioritize the development and optimization of features related to real scenarios.

who's using pisanix.png

【Download Link】

The Pisanix community organizes an online meeting every two weeks. Here are the details to join:

Mailing list
Biweekly meeting - English community (from February 16, 2022), on Wednesday 9:00 AM PST
Biweekly meeting - Chinese community (from April 27, 2022), on Wednesday 9:00 PM GMT+8
Slack
Meeting minutes

You're welcome to join us!