Real-time data integration using Change Data Capture
Data from DBMS can be extracted in many different ways using SQL, table dumps or use of application that sits over the database. These solutions are suitable in such scenarios, but the question to be asked is – can they really deliver data in near real-time? I doubt it. The high computation cost of processing large amounts of data and the time needed for data transfer make these solutions too slow.
An alternative solution - Change Data Capture
The market is full of vendors like Oracle, Attunity, IBM or Informatica which offer real-time data integration solutions with a wide range of functionality and options. Most of them aim to achieve real-time data delivery using Change Data Capture (CDC). This is a mechanism based on the identification, capture, and delivery of only the changes made to operational/transactional data systems.
CDC provides users with access to the latest information allowing pro-active measures to be taken based on the near real-time data. Other benefits from the use of CDC are:
- increased system efficiency - only small amount of data has to be processed (log files) and transferred (changed data),
- cost reduction - lowered system and storage requirements,
- highly available – the system does not pause during data transfer,
- non-invasive solution - changes are captured using database redo log files, so no modifications are required on the source database.
Any other ways?
CDC has some disadvantages, like the fact that if we want to achieve “real” real-time data integration, changes have to be captured as part of a transaction which adds overhead to the source database at capture time. Also, real-time data integration can be achieved using different methods such as data federation or through the use of middleware technologies that connect applications. In short, CDC appears to be the best-fit in many scenarios that require near real-time data. However, CDC is not a one size fits all and the question that needs to be answered is – how does the theoretical performance compare to that seen in deployed infrastructures?