Posts

Showing posts from December, 2017

sqoop incremental import in cloudera hadoop

Image
In the last blog post , I described how we can import data from RDBMS to HDFS using sqoop. Now will discuss how we can do incremental import in cloudera hadoop user interface. If you know the basic functionalities on hadoop, this is a simple task! You need to consider ‘incremental’, ‘check-column’, and ‘last-value’ options to perform the incremental import in sqoop. Following syntax is using for the incremental import --incremental <mode> --check-column <column name> --last value <last check column value> Cloudera hadoop is a commercial version of the hadoop. I am using Oozie workflow UI provided by the cloudera to import data. When you are defining workflows in Oozie UI, you need to give the correct file path for the JDBC driver as well. If you didn’t include the drivers yet, please make sure you include all of those in a folder that can be accessed by everyone. Login to the Hue UI -> Workflows -> editors -> workflows

Import relational databases to hadoop using sqoop

Image
Hello there, This time will discuss how to import the data in to hadoop from the RDBMS. We are using sqoop as the import mechanism. What’s sqoop? It’s an open source software product of the Apache Software Foundation. The tool is designed to transfer data between relational databases and hadoop. It allows users to import data to a target location inside hadoop and export from hadoop as well. If you are not willing to use sqoop to transfer data, there are alternatives available such as spark. But there are some disadvantages like, Spark did not work well for complex data types. Before run the commands to import data, please make sure you installed, Java, Hadoop and sqoop on your workplace.                                                                 Source: severalnines.com When considering hadoop file system, there are two types of table you need to use in the process of importing data. 1. External tables We do create these tab

Introduction to Hadoop

Image
So after a while :D This time, let’s discuss about the hadoop What is hadoop? Hadoop is an open-source framework for storing data and running applications on clusters of device components that are relatively inexpensive and widely available. When dig in to the hadoop and check the internal process, we can identify few core components. Those are 1. Open-source data storage or HDFS which stands for Hadoop Distributed File System. 2. Processing API which is called MapReduce. Commonly in deployments hadoop does include more than 25 other projects or libraries. Few of the common names are HBase, Hive, Pig and Ozzie. Let’s discuss about the hadoop distributions. There are mainly 3 types. 100% Open source - Apache hadoop Commercial - Cloudera, Hortonworks, MapR Cloud - Microsoft Azure HDInsight, AWS Most of the enterprises are stay on one to two full versions behind the currently released version of the hadoop. Because they consider the open source so