Apache Sqoop : A Data Transfer Tool for Hadoop

Updated: Aug 24, 2019


What is Sqoop? :

Apache Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

  • Sqoop imports data from external structured datastores into HDFS (or) related systems like Hive and HBase.

  • Sqoop exports data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

  • Sqoop works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB.

  • Sqoop uses the MapReduce mechanism for its operations like import and export work and work on a parallel mechanism as well as fault tolerance.


Sqoop Architecture :

  • Sqoop uses connector-based architecture which supports plugins that provide connectivity to new external systems.

  • Sqoop provides command line interface to the end users and can also be accessed using Java API.

  • Sqoop can directly import data from an RDBMS into HDFS, Hive and HBase, but you can’t export data from Hive and HBase directly to an RDBMS Instead, you perform all exports from HDFS. When you export Hive tables from your Hadoop cluster to an RDBMS, you do so by pointing to the HDFS directories that store the Hive tables (by default: /user/hive/warehouse)

Sqoop Import :

Sqoop import command:

$ sqoop import --connect jdbc:mysql://SERVER/DB_NAME --username USER_NAME --password PASSWORD --table TableName --target-dir /user/data/test



How Sqoop Imports the Data:

When you issue an import command, this is how the data import process proceeds:

  1. User Run Sqoop Import command.

  2. Sqoop connects to the relational database. Sqoop fetches the table’s metadata, such as the columns and their data types. Sqoop generates a Java class and compiles it. Sqoop connects to the Hadoop cluster.

  3. Sqoop generates a MapReduce job.

  4. Sqoop executes a MapReduce job in the Hadoop cluster.

  5. MapReduce performs the import process with the help of map processes (no reducers) using the metadata generated in step 2.MapReduce generates the output as a set of files in HDFS.

  • The imported data is saved in a directory on HDFS based on the table being imported.User can specify any alternative directory where the files should be populated.By default, these files contain comma delimited fields, with new lines separating different records.User can override the format in which data is copied over by explicitly specifying the field separator and record terminator characters. Sqoop also supports different data formats for importing data.

Sqoop Export :

Sqoop export command:

$ sqoop export

--connect jdbc:mysql://SERVER/DB_NAME --table TARGET_TABLE_NAME --username USER_NAME --password PASSWORD --export-dir EXPORT_DIR



How Sqoop Exports the Data:

  • Sqoop examines the database for metadata, followed by the second step of transferring the data.

  • Sqoop divides the input dataset into splits. Sqoop uses individual map tasks to push the splits to the database.

  • Each map task performs this transfer over many transections in order to ensure optimal throughput and minimal resource utilization.

  • Sqoop can be used to export that data back to RDBMS from HDFS or from a Hive table.

  • In the case of exporting data back to an RDBMS table, the target table must exist in a MySQL database.

  • The rows in HDFS files or records from a Hive table are given as input to the Sqoop command and are called rows in a target table.

  • Those records are read and parsed into a set of records and delimited with a user-specified delimiter.

Sqoop Is Flexible

Most aspects of the import, code generation, and export processes can be customized. You can control the specific row range or columns imported. You can specify particular delimiters and escape characters for the file-based representation of the data. You can specify the file format used. Sqoop provides connectors for MySQL, PostgreSQL, Netezza, Oracle, SQL Server, and DB2.There is also a generic JDBC connector.

About Data Science Authority

Data Science Authority is a company engaged in Training, Product Development and Consulting in the field of Data science and Artificial Intelligence. It is built and run by highly qualified professionals with more than 10 years of working experience in Data Science. DSA’s vision is to inculcate data thinking in to individuals irrespective of domain, sector or profession and drive innovation using Artificial Intelligence.

Data Science Authority | Data Science Training in Hyderabad

205 views

FOLLOW US ON

  • Facebook Social Icon
  • Twitter Social Icon
  • LinkedIn Social Icon

ADDRESS

Gachibowli, Hyderabad, Telangana, India

©2020  Data Science Authority