Techniques Thoughtworks Technology Radar 17. The streaming data warehouse Assess The need to respond quickly to customer insights has driven increasing adoption of event-driven architectures and stream processing. Frameworks such as Spark, Flink or Kafka Streams offer a paradigm where simple event consumers and producers can cooperate in complex networks to deliver real-time insights. But this programming style takes time and effort to master and when implemented as single-point applications, it lacks interoperability. Making stream processing work universally on a large scale can require a significant engineering investment. Now, a new crop of tools is emerging that offers the benefits of stream processing to a wider, established group of developers who are comfortable using SQL to implement analytics. Standardizing on SQL as the universal streaming language lowers the barrier for implementing streaming data applications. Tools like ksqlDB and Materialize help transform these separate applications into unified platforms. Taken together, a collection of SQL-based streaming applications across an enterprise might constitute a streaming data warehouse. 18. TinyML Assess Until recently, executing a machine-learning (ML) model was seen as computationally expensive and in some cases required special-purpose hardware. While creating the models still broadly sits within this classification, they can be created in a way that allows them to be run on small, low-cost and low-power consumption devices. This technique, called TinyML, has opened up the possibility of running ML models in situations many might assume infeasible. For example, on battery-powered devices, or in disconnected environments with limited or patchy connectivity, the model can be run locally without prohibitive cost. If you’ve been considering using ML but thought it unrealistic because of compute or network constraints, then this technique is worth assessing. 19. Azure Data Factory for orchestration Hold For organizations using Azure as their primary cloud provider, Azure Data Factory is currently the default for orchestrating data-processing pipelines. It supports data ingestion, copying data from and to different storage types on prem or on Azure and executing transformation logic. Although we’ve had adequate experience with Azure Data Factory for simple migrations of data stores from on prem to the cloud, we discourage the use of Azure Data Factory for orchestration of complex data-processing pipelines and workflows. We’ve had some success with Azure Data Factory when it’s used primarily to move data between systems. For more complex data pipelines, it still has its challenges, including poor debuggability and error reporting; limited observability as Azure Data Factory logging capabilities don’t integrate with other products such as Azure Data Lake Storage or Databricks, making it difficult to get an end-to-end observability in place; and availability of data source-triggering mechanisms only to certain regions. At this time, we encourage using other open- source orchestration tools (e.g., Airflow) for complex data pipelines and limiting Azure Data Factory for data copying or snapshotting. Our teams continue to use Data Factory to move and extract data, but for larger operations we recommend other, more well-rounded workflow tools. © Thoughtworks, Inc. All Rights Reserved. 17
