Many downstream database and analytics systems need to be routinely refreshed
with processed versions of the latest changes from production. Typically the new data needs go through an
ETL process to prepare it for downstream use in an Analytics Engine, Warehouse,
Data Mart, Reader Farm, etc.
There are several major benefits Delphix offers in this scenario.
Firstly Delphix maintains its own up-to-date copy of source data by
seamlessly synchronising with the source databases and retaining this in a secure
central data hub.
Second, all ETL processes can be run directly from virtual databases
(VDBs) provisioned by Delphix, completely offloading production and without the
need for physical copies. You can create
as many VDBs as you want to run these processes in parallel.
Third, if you wish to keep pace with new transactions you can use Delphix Live Sources, which uses an Oracle Standby to keep the Delphix dSource copy continuously updated but also enables reporting using Oracle Active Dataguard. If synchronising with a target Hadoop cluster you can run Apache Sqoop jobs against a virtual data warehouse that is constantly being updated from the physical data warehouse. You can run your ETL directly on the changed data stored in the virtual data warehouse and import only the processed changes into your Hadoop cluster.
In addition it is also possible to use Delphix to help identify deltas
between two versions of a source database by creating two VDBs based on two
point-in-time snapshots.
If the two VDBs (snap1 and snap2) are based on an Oracle source
database, first you simply create a database link connecting to snap1 from snap2.
On snap2 you create tables to store:
1. INSERTS,
rows that have been added since the old version (snap1)
2. UPDATES, rows
that have been modified between the two versions
3. DELETES,
rows that no longer exist in the new version (snap2)
You can
identify INSERTS, UPDATES and DELETES between the two versions using the MINUS
set operator between the corresponding tables in the two VDBs.
For example
to find and populate my local table “delta_inserts” with new rows, I use
something like:
/* create empty table
with correct structure */
create table
delta_inserts as select * from table where 1=0;
alter table
delta_inserts nologging;
/* populate table with
new rows only */
insert /*+ append */
into delta_inserts
with new_rows as
(
select
id from table@snap2
minus
select id from table@snap1
)
/* fetch entire row
from latest version */
select latest.*
from table@snap2 latest,
new_rows delta
where latest.c_id = delta.c_id;
The SQL is
similar for UPDATES and DELETES. However
for UPDATES we need to compare the entire row in the minus operation since any
column value might have changed. For
DELETES we only need to store the key or id we will use to identify the rows to
delete during the apply process.
Now we have
two tables containing copies of changed rows and one with a list of rows to be
deleted.
We can now
run our ETL processing on only the changed rows held in delta_inserts and
delta_updates making this process substantially more efficient than running
against the full dataset.
The final
phase is to implement an apply process to merge the processed deltas with the
target database. The options for
applying the changes will depend on the nature of the changes and the type of
the target database which might be relational, Hadoop (HDFS), etc.
For Hadoop
you can use Apache Sqoop to perform incremental imports and merges. For relational databases there are more
options.
The apply
process would usually
1. delete all
the rows in the target table having the keys saved in delta_deletes
2. insert or
append all the rows to the target table having processed rows saved in
delta_inserts
3. merge or
update all the rows in the target table having processed rows saved in
delta_updates
If there
are foreign key dependencies then you might need to disable and re-enable them before
and after the apply process.
In summary,
Delphix is ideal for provisioning data sets for ETL processing. Delphix Virtual Databases (VDBs) can also
underpin the process to identify
changes between two versions of the same database. ETL jobs can then operate
against only the changed rows and finally once processed these changes can be used
to refresh the target database using a simple apply process.
Learn more at Delphix Agile Data Management