Import relational databases to hadoop using sqoop
Hello there, This time will discuss how to
import the data in to hadoop from the RDBMS. We are using sqoop as the import
mechanism.
What’s sqoop?
It’s an open source software product of the
Apache Software Foundation. The tool is designed to transfer data between
relational databases and hadoop. It allows users to import data to a target
location inside hadoop and export from hadoop as well.
If you are not willing to use sqoop to transfer
data, there are alternatives available such as spark. But there are some
disadvantages like, Spark did not work well for complex data types.
Before run the commands to import data, please
make sure you installed, Java, Hadoop and sqoop on your workplace.
Source:
severalnines.com
When considering hadoop file system, there are
two types of table you need to use in the process of importing data.
1. External tables
We do create these tables using EXTERNAL keyword
and the table will point to any HDFS location that is specified by LOCATION
Keyword.
Below is a sample query on creating external
tables.
CREATE EXTERNAL TABLE IF NOT EXISTS
db1.ext_authors (ID int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME
TIMESTAMP)ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION
'/user/thilina_08838/source/ideabiz/authors/data';
We need to make sure that data files in the
folder specified with the LOCATION keyword, contains data file(s) that matches
the format specified in the CREATE statement. The data will not be removed when
you drop an external table. Hive just drops the table schema.
Also hive does not manage any file, it manages
only the table schema and just knows what external folder apply to it.
2. Internal tables
These tables are similar to normal database
table where data can be stored and queried on. By dropping these tables the
data stored in tables also gets deleted and data is lost. Hence we need to
careful while using internal tables as one drop command can destroy the whole
data.
CREATE TABLE IF NOT EXISTS db1.authors (ID
int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP) PARTITIONED
BY(time_key INT)
Source: edureka.com
We do use EXTERNAL tables when:
- The data is also used outside of Hive. For example, the
data files are read and processed by an existing program.
- Data needs to remain in the underlying location even
after a DROP TABLE. This can apply if you are pointing multiple schemas
(tables or views) at a single data set or if you are iterating through
various possible schemas.
Use INTERNAL tables when:
- The data is temporary.
- You want hive to completely manage the lifecycle of the
table and data.
Okay, now we are ready, here’s the workflow.
1. Create the connection between hadoop servers
and RDBMS through JDBC connector.
2. Import data in to the external folder in HDFS
3. Get data in to the external table
4. Get data in to the internal table from
external table
First you should logging to the master node of
your hadoop cluster via the terminal. Then, launch the Sqoop job,
sqoop import \
--jdbc:mysql://172.24.23.145:3306/books \
--username=hadoop \
--password=hadoop \
--append \
--jdbc:mysql://172.24.23.145:3306/books \
--username=hadoop \
--password=hadoop \
--append \
--table=authors \
--as-textfile \
--as-textfile \
--fields-terminated-by="|"
\
--target-dir=/user/thilina_03238/source/ \
-m 1
--target-dir=/user/thilina_03238/source/ \
-m 1
Above command will launch MapReduce jobs to pull
the data from our MySQL database and write the data to HDFS in parallel. You
can confirm that data has been pushed to the target directory by below command.
$ $HADOOP_HOME/bin/hadoop fs -cat /user/thilina_03238/source/part-m-*
Alright! When you are creating external table,
you need to use the same directory that you pushed data from RDBMS.
CREATE EXTERNAL TABLE IF NOT EXISTS
db1.ext_authors (ID int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP)ROW
FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION
'/user/thilina_03238/source/';
You can try below query and confirm, the data is
in the external table.
SELECT * FROM ext_authors LIMIT 100
Here’s the most important section. Create
internal table and import data from external table. We do create a partition
key, in the internal table. Because, query response time is faster to process
the small part of the data instead of looking for a search in the entire data
set.
Also having too many partitions in table will
create large number of files and directories in HDFS, since the NameNode must
keep all metadata for the file system in memory only it’s going to be an
overhead.
CREATE TABLE IF NOT EXISTS db1.authors (ID
int,NAME varchar(20),BOOK_NAME varchar(20),DATETIME TIMESTAMP) PARTITIONED
BY(time_key INT)
Now, will input the data from external to
internal tables by using below query.
set hive.exec.dynamic.partition.mode=nonstrict;
insert into ideamart.authors partition(time_key)
select ID, NAME, BOOK_NAME, DATETIME
TIMESTAMP,cast(from_unixtime(unix_timestamp(cast(TIMESTAMP as
string),'yyyy-MM-dd HH:mm:ss'),'yyyyMM') as int) time_key
from ext_authors
set hive.exec.dynamic.partition.mode=strict
means
whenever you are populating hive table it must
have at least one static partition column.
set hive.exec.dynamic.partition.mode=nonstrict
In this mode you don't need any static partition
column.
Phew! Now we have the data on hive table. You
can query as below
SELECT * FROM authors LIMIT 100
This is the basic way of importing data from
RDBMS to HDFS. But how we can automate this process? Let’s discuss the automation of the sqoop import
in the next post. :)
Comments
Post a Comment