Comparing Large Resultsets

Overview

CompareData allows you to compare the data of very large resultsets by partitioning a large comparison resultset and/or comparing the resultset data using an inline comparison.

Default Comparison Method

By default, when you compare the data of two resultsets, all the resultset data is retrieved from each data source, and then each resultset data is sorted at the client machine using the comparison key columns that are set up for the comparison.

After both resultset data are retrieved and sorted, the data is compared at the client machine.

In order to use this method, both resultset data must fit in the client machine memory.

Note that even though ODBC drivers may convert data from the data source server character set to the character set used at the client machine, comparison resultsets can be safely assumed to be in the correct sorting order for comparing at the client machine because the resultset data is sorted at the client machine and not at the data source DBMS server.

Inline Comparison Method

Using an inline comparison, the resultset data is sorted at the DBMS server typically using an SQL Order By clause and then a fixed-size block of rows is retrieved from each data source and then the rows of two blocks are compared at the client machine and only rows that satisfy a row limit criteria are kept at the client machine. After two blocks of rows are compared, then the next two blocks of rows are retrieved and compared and so forth until the end of the resultsets is reached.

CompareData allows you to set up a limit on the number of rows to keep depending on their comparison result. For example, you may want to keep only the first 100 rows that have no differences, all rows that are new, and all the rows that have differences. Or you may set a limit of zero for all rows in which case no rows are kept at the client machine after a block rows is compared.

Since you can set limits on the number of rows to keep, you can compare very large resultsets as long as the rows you decide to keep will fit in the client machine memory.

A requirement for inline comparison is that since the data is sorted at the servers and compared at the client, both servers and the client must use compatible character sets and collation sequences. So that if 'A' is smaller than 'a' on one server, it is also smaller on the other server and also at the client machine where rows are compared.

If none of the resultset comparison keys are character type, then this is not an issue, otherwise, you may get incorrect comparison results if any of the comparison key columns are character type and the two servers and the client machine character sets/collation sequences are not compatible.

Partion Resultset

A table data comparison may be partioned into multiple comparisons by copy/pasting the table comparison into a user folder and then setting up a where clause for each partition. This requires that the data of the where clauses for all the paritionas cover all the data of the comparison without an overlap.

For example, the modulo (% or mod) operator may be used on a numeric primary key/indexed column to construct the partitions where clause.

When comparing/synchronizing the data of a partitions folder in its items view, each partition is considered a separate data comparison and one or more parititons may be compared/synchronized concurrently allowing a very large comparison to be parallized using the items view max-active property.

500M row comparison partitioned into 10 50M row partitions