15.12. Understanding Import

Documentation

VoltDB Home » Documentation » Using VoltDB

15.12. Understanding Import

Just as VoltDB can export data from selected streams to various targets, it supports importing data to selected tables from external sources. Import works in two ways:

  • One-time import of data using one of several data loading utilities VoltDB provides. These data loaders support multiple standard input protocols and can be run from any server, even remotely from the database itself.

  • Streaming import as part of the database server process. For data that is imported on an ongoing basis, use of the built-in import functionality ensures that import starts and stops with the database.

The following sections discuss these two approaches to data import.

15.12.1. One-Time Import Using Data Loading Utilities

Often, when migrating data from one database to another or when pre-loading a set of data into VoltDB as a starting point, you just want to perform the import once and then use the data natively within VoltDB. For these one-time uses, VoltDB provides separate data loader utilities that you can run once and then stop.

Each data loader supports a different source format. You can load data from text files — such as comma-separated value (CSV) files — using the csvloader utility. You can load data from another JDBC-compliant database using the jdbcloader utility. Or you can load data from a streaming message service with the Kafka loader utility, kafkaloader.

All of the data loaders operate in much the same way. For each utility you specify the source for the import and either a table that the data will be loaded into or a stored procedure that will be used to load the data. So, for example, to load records from a CSV file named staff.csv into the table EMPLOYEES, the command might be the following:

$ csvloader employees --file=staff.csv

If instead you are copying the data from a JDBC-compliant database, the command might look like this:

$ jdbcloader employees \
     --jdbcurl=jdbc:postgresql://remotesvr/corphr \
     --jdbctable=employees \
     --jdbcdriver=org.postgresql.Driver

Each utility has arguments unique to the data source (such as --jdbcurl) that allow you to properly configure and connect to the source. See the description of each utility in Appendix D, VoltDB CLI Commands for details.

15.12.2. Streaming Import Using Built-in Import Features

If importing data is an ongoing business process, rather than a one-time event, then it is desirable to make it an integral part of the database system. This can be done by building a custom application to push data into VoltDB using one of its standard APIs, such as the JDBC interface. Or you can take advantage of VoltDB's built-in import infrastructure.

The built-in importers work in much the same way as the data loading utilities, where incoming data is written into one or more database tables using an existing stored procedure. The difference is that the built-in importers start automatically whenever the database starts and stop when the database stops, making import an integral part of the database process.

You configure the built-in importers in the configure file the same way you configure export connections. Within the <import> element, you declare each import stream using separate <configuration> elements. Within the <configuration> tag you use attributes to specify the type and format of data being imported and whether the import configuration is enabled or not. Then enclosed within the <configuration> tags you use <property> elements to provide information required by the specific importer and/or formatter. For example:

<import>
  <configuration type="kafka" format="csv" enabled="true">
    <property name="brokers">kafkasvr:9092</property>
    <property name="topics">employees</property>
    <property name="procedure">EMPLOYEE.insert</property>
  </configuration>
</import>

When the database starts, the import infrastructure starts any enabled configurations. If you are importing multiple streams to separate tables through separate procedures, you must include multiple configurations, even if they come from the same source. For example, the following configuration imports data from two Kafka topics from the same Kafka servers into separate VoltDB tables.

<import>
  <configuration type="kafka" enabled="true">
    <property name="brokers">kafkasvr:9092</property>
    <property name="topics">employees</property>
    <property name="procedure">EMPLOYEE.insert</property>
  </configuration>
  <configuration type="kafka" enabled="true">
    <property name="brokers">kafkasvr:9092</property>
    <property name="topics">managers</property>
    <property name="procedure">MANAGER.insert</property>
  </configuration>
</import>

VoltDB currently provides support for two types of import:

VoltDB also provides support for two import formats: comma-separated values (csv) and tab-separated values (tsv). Comma-separated values are the default format. So if you are using CSV-formatted input, you can leave out the format attribute, as in the preceding example.

The following sections describe each of the importers and the CSV/TSV formatter in more detail.