
#PENTAHO DATA INTEGRATION SCALING HOW TO#
Then we explain how to make your transformations scale out on a cluster of slave servers.įinally we cover the finer points of Kettle partitioning and how it can help you parallelize your work even further. The first part of this chapter deals with the parallelism inside a transformation and the various ways to make use of it to make it scale up. Both these approaches are part of ETL subsystem #31, the Parallelizing/Pipelining System. The cluster ability of this tool helps in horizontal scaling which improves. It supports deployment on single node computers as well as on a cloud, or cluster. Scaling out is using the resources of multiple machines and have them operate in parallel. Pentaho Data Integration: how to normalize or parse large volumes of data. 5 Pentaho Data Integration, codenamed Kettle, consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations. Bayon used three datasets (50, 100, and 300 GB) prepared using the TCP-H data generator. His presentation will cover the topic of scaling Pentaho Server with Kubernetes.

Diethard Steiner, passionate Pentaho user, will go more into the technical details at Pentaho Community Meeting in Bologna. This whitepaper is well worth reading for anyone considering adopting PDI. At Pentaho User Meeting in March, Nis Christian Carstensen talked about how to run Pentaho in a Kubernetes cloud. Scaling up is using the most of a single server with multiple CPU cores. Pentaho Data Integration: Scaling Out Large Data Volume Processing in the Cloud or on Premise. In this chapter, we unravel the secrets behind making your transformations and jobs scale up and out. Pentaho Data Integration (PDI) is an extract, transform and load (ETL).

Whether you have a single personal computer or hundreds of large servers at your disposal you want to make Kettle use all available resources to get results in an acceptable timeframe. The Pentaho Business Analytics server can effectively scale out to a (see Figure. When you have a lot of data to process it's important to be able to use all the computing resources available to you. Chapter 16. Parallelization, Clustering, and Partitioning
