At least one windows service is paired with each web service API to act as a scheduler. The scheduler triggers the execution of automated asynchronous processes via calls to a SOAP one-way version of the web service API front controller. The state of the automated processes are tracked in mirrored database tables within the System database, and the state data is relevant only within a single location. However, the logic of the processes themselves can span across locations or perform integration work with remote external systems, etc.
The processes are small and fine-grained by design, preferring short, frequent micro-batches of incremental work rather than long-running crawls through massive amounts of data all at once. Simple processes with little or no state to track only require a single record in the Systemwork table. Complex processes (such as replication) have their own separate state tables, joined one-to-one with a corresponding Systemwork schedule record.
Many windows services compete to claim a batch of Systemwork records, each of which acts as a token granting the “right” for a given windows service to execute the process. The windows service uses ADO.NET to query the System database directly. The contention of the race condition to claim the Systemwork records is handled by the database engine, and from that point forward there is no contention within the automated processing thanks to the thread’s ownership of the token.
By default, the asynchronous API methods are executed assuming that each micro-batch must be processed sequentially. The internal concrete methods are executed synchronously, the process state info is updated after the method call has returned, and the schedule record is updated as a final step. This is necessary for processes such as replication, which act as an incremental forward scrolling cursor through the data of the source table. For other processes that are more self-contained (and therefore also do not have a shared mutable state to maintain), the internal API method can execute the internal concrete method asynchronously with much better performance.
The scheduled interval of the Systemwork records along with the configurable size of each micro-batch to be processed act to throttle the workload added to each processing node (remember that a processing node is also handling the external API method calls of client applications). Ultimately, a mechanism similar the one used by the TCP protocol to manage throughput can be used to balance the competing workloads of event-driven and automated processing (this logic has not yet been implemented, but all the necessary features are already in place).
** An important fact about Distribware is that it is designed for the large IOPS, high throughput and low latencies of SSD’s. It can be implemented using mechanical disk storage, but only for small-scale systems. Any systems that require decent performance and scalability must use SSD’s for the database servers at a minimum. I know I mentioned this in other parts of the documentation, but it is worth repeating.
As mentioned elsewhere in the documentation, the Distribware platform is designed for modularity and easy extensibility. Custom systems are built using the platform by adding new custom API classes, plus new code libraries and database schemas used by the new API methods. The same message-processing pipeline is used by the new API’s just like it is used by the core API’s.
Many of the automated processes are based on an incremental forward-scrolling cursor mechanism that crawls through the data, performing various types of work as it goes. Large volumes of data can be broken up into discrete ranges, each with its own separate process crawling through the data either forward or backward, in parallel with all the other processes running at the same time on their own segment of the data.
- Replication: the core replication process has more in common with a real-time ETL process (minus the transform) than it does with true replication. It is described in the documentation as a real-time, incremental backup – but it also makes sense to look at it as an extract from the source and a load into the destination. This is why it is so easy to convert the replication process into a full ETL process by just adding some transformation logic in the middle.
- Verification: the continuous scan of replicated tables looking for gaps, duplicates and inconsistent values between each source and each destination.
- ETL: these processes are based on the replication logic, with a few modifications. First, the source placeholder field is configurable rather than fixed by convention. Second, transformation logic is added after the extract to match the structure of the destination. In many cases it is possible to perform bi-directional ETL as a type of system integration. If it is possible to add several fields to the destination schema, then the verification logic can also be run to insure the integrity of the data from the source to the destination.
- Cleansing: data cleansing is usually a process of standardizing the values of various data fields/elements, etc. It can be added as part of a transformation, or run as its own separate process. The primary choice to make when cleansing is whether or not to overwrite the non-standard values, or preserve them and standardize the values in a parallel set of data.
- Tagging/Categorization: this process creates new metadata linked to the main body of data. Subsequent processes can use the tag or category values when performing analyses, data pre-processing, etc.
- Counts/Grouping: similar to tagging and categorization, but instead of creating new metadata linked to the source, it creates counts and/or aggregates from the source data.
There is no real limit to the types of automated processing that can be performed by adding new API classes and methods to a custom system. If the logic can be written in C#, then it can usually be written as an API method and executed automatically.