This article is a bare bones introduction to SQL Server Integration Services (SSIS), with an emphasis on the SSIS development environment. SSIS is Microsoft’s full feature extract-transform-load (ETL) tool in SQL Server 2005, 2008, and R2. SSIS is used to extract data from any of a myriad of sources, such as SQL Server databases, flat files, Excel files, Oracle and DB2 databases, etc. It performs data manipulation and transformation on the data and delivers it to one or more data destinations.
While SSIS is part of the SQL Server suite of tools, and is very adept at handling SQL Server-to-SQL Server ETL operations, it easily handles many other source and destination you have a library for. Potential applications include scrubbing and exporting Excel data for output to network directories, or third party database ETL transformations (e.g., Oracle-to-Oracle data loads). Let’s create a new SSIS package and take a brief look at the major components and features of the SSIS development environment.
Visual Studio vs. BIDS
The SSIS development environment can be installed during a SQL Server installation or independently, i.e., you don’t have to have the SQL Server engine installed to develop SSIS packages. The SSIS development environment consists of the Business Intelligence Development Studio (BIDS), a full-fledged Visual Studio installation with components specific to SSIS and other elements comprising SQL Server business intelligence. If you already have Visual Studio 2008 installed on your computer, the SSIS installation will install a separate BIDS application and add components to your existing Visual Studio. You can develop in either environment, as they are identical.
Like all files in Visual Studio, all SSIS packages are part of a VS project, which is in turn part of a solution. To create a new project and a new SSIS package:
- Start BIDS or Visual Studio
- Select File > New > Project
- Select “Integration Services Project” from Business Intelligence Projects
- Name the project and select the location. Click OK.
The new project will display a new, blank package. Your most valuable windows are:
- Toolbox – Contains the elements you can add to your package. The contents of the Toolbox change depending on what main window tab you’re in.
- Main Window – The development window. Each tab shows a different level or focus of the development. Today we are only interested in two of those tabs.
- Control Flow tab – Contains control flow elements. Control flow is anything that moves in the package – anything that causes you to go from one step to another. Examples include moving or deleting a file, creating indexes, etc.
- Data Flow tab – Data flow is anything that physically moves data. All sources, destinations, and transformations are included in data flow.
- Connection Managers – In SSIS, there is a single connection manager for each connection. When you create a data source, you assign it a connection manager. After that, if the connection manager changes, all data sources that use it are affected as well. That wasn’t the case in DTS (the ETL tool prior to SQL Server 2005); if you had to change a connection, you had to go to each and every task and change it there individually.
- Solution Explorer – This displays the Package elements – the package itself, data sources, packages, etc.
- Properties – This is a context sensitive menu that displays the properties for whatever is currently selected, from the package itself down to individual components.
In the Solution Explorer, right-click and rename your package to something descriptive, like “MyFirstPackage.dtsx”. Answer Yes to the popup window “Do you want to rename the package object as well?” You can also add existing packages to your project, or import packages from the file system or from a SQL Server.
Tabs: Control Flow vs. Data Flow
In DTS (the SQL Server 2000 predecessor to SSIS), control flow and data flow were the same thing, which caused a lot of problems – especially with sequencing tasks. In SSIS, control flow and data flow are viewed and controlled separately, using two tabbed windows. Control flow is anything that controls or moves in the package – anything that causes you to go from one step to another. A few examples include loops (like For and For Each), scripting and FTP, send mail, creating indexes, or data profiling. Notice that “Data Flow Task” is an element in the Control Flow’s Toolbox pane. Data flow is anything that physically moves and manipulates the data in the SSIS pipeline (as opposed to running a SQL command that runs on the SQL Server). Data sources, destinations, and transformations in the data flow allow you to:
- pull data from one or many disparate sources
- convert data types
- clean up “dirty” data
- split or merge data from one or more sources
- and much more.
For more information on SSIS, reference:
- The ongoing MidnightDBA Ground Zero SSIS video course: Part 1, Part 2, Part 3
- Integration Services page at MSDN