The MapPoint Batch Geocoder is an open source transformation component for SQL Server Integration Services that performs batch geocoding of address information directly within the SSIS pipeline using the geocoding capabilities of MapPoint's Customer Data Services. The transformation outputs both the original input data and the geodetic spatial data using the new Spatial Data Types of SQL Server 2008.
The transformation leverages the geocoding capabilities of MapPoint's Customer Data Services. Address information within the pipeline is buffered and then uploaded to MapPoint CDS in batches. Once MapPoint geocodes the addresses, the data is downloaded and reintegrated back into the SSIS pipeline. During reintegration, spatial data is joined back with the original input data and continued through the pipeline.
The output columns include both the original input columns as well as two additional columns:
- GeoPoint - Geocoded address as a SqlGeography spatial data type.
- MatchCode - The geocoding accuracy MapPoint associates with each address.
- Geocodes addresses directly in the SSIS pipeline.
- Performs geocoding in batches as opposed to row-by-row.
- Leverages the geocoding capabilities of MapPoint's Customer Data Services.
- Outputs the SQL Server 2008 geodetic (vs. geometry) spatial data type.
- Ability to select which input columns contain address information (AddressLine, PostalCode, etc).
- Variable based configuration of MapPoint settings to allow environment specific configuration.
- The ability to select between the staging or production environments of MapPoint's CDS.
- Pass-thru of additional input columns of data (non address data) in the output pipeline.
- Allows stubbing of the geocoding operation (generation of random values for lat/lon instead of invoking the MapPoint web services) for disconnected testing scenarios.
- Web Service retry capability in case of minor network/Internet failures, for added resiliency.
- SQL Server 2008 Integration Services
- Visual Studio 2008 C# (for code modification and compilation)
- .NET Framework 3.5
- Compile the MapPoint Batch Geocoder transformation component.
- Copy the transformation component (MapPoint Batch Geocoder.dll) to the PipelineComponents folder for SSIS (i.e. "C:\Program Files\Microsoft SQL Server\100\DTS\PipelineComponents\").
- Add the assembly to the Global Assembly Cache (GAC). (see article 'How to: Install an Assembly into the Global Assembly Cache' - http://msdn.microsoft.com/en-us/library/dkkx7f79.aspx)
- On the Data Flow tab of SSIS, right click in the toolbox and select "Choose Items…".
- Select the "SSIS Data Flow Items" tab.
- Choose "MapPoint Batch Geocoder" from the list of components, and click OK.
The component should now be available from the list of Data Flow Transformations in the toolbox. For convenience, an installation package is also available which installs the runtime version. However, it does not add the component to Visual Studio.
The MapPoint Batch Geocoder works just like any other transformation component; once you add a Data Flow task to the Control flow design surgace, drag and drop the "MapPoint Batch Geocoder" from the toolbox and place it somewhere between a source and destination. After connecting the transformation to upstream (input) item and selecting the available input columns, configuration of the geocoder can begin.
Below is a list of the available configuration options:
- MapPoint UserName - The user name of the MapPoint account to use for geocoding.
- MapPoint Password - The password of the MapPoint account to use for geocoding.
- MapPoint DataSourceName - A fully qualified target data source name used when uploading data to MapPoint. All address data uploaded to MapPoint must be stored in a data source. Typically this data source is used to store address information by subject area and then referenced when rendering a map. For our purposes, we will use this data source temporarily for the life of the package instance, and discard once the geocoding completes. The data sources must be uniquely named to avoid collisions during processing when multiple instances of the batch geocoder execute simultaneously (i.e. MyAccount123456.123456.Offices and MyAccount123456.123456.Customers).
- MapPoint UseStaging - Set to true to perform the geocoding in MapPoint's staging environment.
- Source AddressLine - The input column to use as the AddressLine as required by MapPoint.
- Source EntityID - The input column to use as the EntityIDas required by MapPoint.
- Source PostalCode - The input column to use as the PostalCode as required by MapPoint.
- Source PrimaryCity - The input column to use as the PrimaryCity as required by MapPoint.
- Service Stub MapPoint Data - If true, will cause the transformation to avoid any Web Service calls and synthesis fake lat/lon values instead of invoking the service. Useful for testing purposes.
- Service Retry Attempts - Number of attempts to retry accessing MapPoint services in case of network failure.
- Service Retry Delay - The delay between retries (in seconds).
Log events are generated during geocoding and can be captured using standard SSIS logging. All events are reported under the OnInformation event.
Isolating the Geocoding Process
The possibility of connectivity failure increases anytime Internet-based services are involved, as is the case with this component’s dependency on the MapPoint’s CDS webservice. The ETL processes can be designed so that a geocoding failure does not cause the entire ETL process to fail, particularly if the ETL process is long and difficult/tedious to restart.
Instead of performing geocoding as an extra data flow step in an already complex package, a more resilient design is to have a separate package which performs the geocoding against records which have already been processed by ETL and landed in the target database. The package will only geocode records which have yet to be geocoded, indicated by a null or flag. Once the geocoding is complete, the package updates the existing rows and clears the flag.
Because of the design, the geocoding package can be executed out-of-band, which handles the case of a network failure or even a historical reload of data.
Due to the nature of how address information is entered by operational systems, it is quite common for this information to contain errors (formatting, spelling, typos, etc). However, the MapPoint Batch Geocoder will not perform any data cleansing of address information before submitting to MapPoint. Since error-prone address information will yield poor or incorrect geocoding results, consideration for adding data cleansing practices to ETL processes should be made.
The MatchCode value returned by MapPoint can be used to identify records that had an issue being geocoded. A decision should be made for each possible MatchCode value of whether the quality of the geocode is useful enough to be consumed by downstream systems and whether further address cleansing (i.e. manual intervention) is required.
Controlling the MapPoint Batch Size
For performance reasons, geocoding is performed in batch mode. The batch size is controlled by the SSIS buffer controls DefaultBufferMaxRows and DefaultBufferSize properties of a data flow task. Managing the batch size is useful for achieving the right balance between geocoding too much data at once (i.e. all is lost in case of failure, or MapPoint limitations), and overall process performance.
Because the MapPoint Batch Geocoder is based on the MapPoint CDS webservice, it must adhere to the rules and constraints dictated by the Customer Data Services. Although most of these restrictions are already accounted for in the transformation’s design, some may still require consideration. For instance:
- The combined size of all your data sources cannot be more than 2 gigabytes after being uploaded and uncompressed.
- Any single file for uploading cannot be more than 100 megabytes (compression is allowed).
See Formatting MapPoint Web Service Data Source Files for more information - http://msdn.microsoft.com/en-us/library/cc534849.aspx