Import relational databases to hadoop using sqoop






Hello there, This time will discuss how to import the data in to hadoop from the RDBMS. We are using sqoop as the import mechanism.

What’s sqoop?

It’s an open source software product of the Apache Software Foundation. The tool is designed to transfer data between relational databases and hadoop. It allows users to import data to a target location inside hadoop and export from hadoop as well.
If you are not willing to use sqoop to transfer data, there are alternatives available such as spark. But there are some disadvantages like, Spark did not work well for complex data types.

Before run the commands to import data, please make sure you installed, Java, Hadoop and sqoop on your workplace.


                                                                Source: severalnines.com



When considering hadoop file system, there are two types of table you need to use in the process of importing data.

1. External tables

We do create these tables using EXTERNAL keyword and the table will point to any HDFS location that is specified by LOCATION Keyword.


Below is a sample query on creating external tables.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.ext_authors (ID int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/thilina_08838/source/ideabiz/authors/data';

We need to make sure that data files in the folder specified with the LOCATION keyword, contains data file(s) that matches the format specified in the CREATE statement. The data will not be removed when you drop an external table. Hive just drops the table schema.
Also hive does not manage any file, it manages only the table schema and just knows what external folder apply to it.

2. Internal tables

These tables are similar to normal database table where data can be stored and queried on. By dropping these tables the data stored in tables also gets deleted and data is lost. Hence we need to careful while using internal tables as one drop command can destroy the whole data.

CREATE TABLE IF NOT EXISTS db1.authors (ID int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP) PARTITIONED BY(time_key INT)


                           Source: edureka.com

We do use EXTERNAL tables when:
  1. The data is also used outside of Hive. For example, the data files are read and processed by an existing program.
  2. Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.

Use INTERNAL tables when:

  1. The data is temporary.
  2. You want hive to completely manage the lifecycle of the table and data.

Okay, now we are ready, here’s the workflow.

1. Create the connection between hadoop servers and RDBMS through JDBC connector.
2. Import data in to the external folder in HDFS
3. Get data in to the external table
4. Get data in to the internal table from external table

First you should logging to the master node of your hadoop cluster via the terminal. Then, launch the Sqoop job,





sqoop import \
    --jdbc:mysql://172.24.23.145:3306/books \
    --username=hadoop \
    --password=hadoop \
    --append \
    --table=authors \
    --as-textfile \
    --fields-terminated-by="|" \
    --target-dir=/user/thilina_03238/source/ \
    -m 1

Above command will launch MapReduce jobs to pull the data from our MySQL database and write the data to HDFS in parallel. You can confirm that data has been pushed to the target directory by below command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/thilina_03238/source/part-m-*


Alright! When you are creating external table, you need to use the same directory that you pushed data from RDBMS.

CREATE EXTERNAL TABLE IF NOT EXISTS db1.ext_authors (ID int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/user/thilina_03238/source/';

You can try below query and confirm, the data is in the external table.  

SELECT * FROM ext_authors LIMIT 100



Here’s the most important section. Create internal table and import data from external table. We do create a partition key, in the internal table. Because, query response time is faster to process the small part of the data instead of looking for a search in the entire data set.
Also having too many partitions in table will create large number of files and directories in HDFS, since the NameNode must keep all metadata for the file system in memory only it’s going to be an overhead.

CREATE TABLE IF NOT EXISTS db1.authors (ID int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP) PARTITIONED BY(time_key INT)

Now, will input the data from external to internal tables by using below query.

set hive.exec.dynamic.partition.mode=nonstrict;
insert into ideamart.authors partition(time_key)
select ID, NAME, BOOK_NAME, DATETIME TIMESTAMP,cast(from_unixtime(unix_timestamp(cast(TIMESTAMP as string),'yyyy-MM-dd HH:mm:ss'),'yyyyMM') as int) time_key
from ext_authors



set hive.exec.dynamic.partition.mode=strict means
whenever you are populating hive table it must have at least one static partition column.

set hive.exec.dynamic.partition.mode=nonstrict
In this mode you don't need any static partition column.

Phew! Now we have the data on hive table. You can query as below

SELECT * FROM authors LIMIT 100

This is the basic way of importing data from RDBMS to HDFS. But how we can automate this process? Let’s discuss the automation of the sqoop import in the next post. :)




Comments

Popular posts from this blog

CAP THEOREM

Quality Assurance in Agile Software Development

Hash Functions