Introduction to Hadoop





So after a while :D This time, let’s discuss about the hadoop

What is hadoop?

Hadoop is an open-source framework for storing data and running applications on clusters of device components that are relatively inexpensive and widely available.
When dig in to the hadoop and check the internal process, we can identify few core components.

Those are
1. Open-source data storage or HDFS which stands for Hadoop Distributed File System.
2. Processing API which is called MapReduce.

Commonly in deployments hadoop does include more than 25 other projects or libraries. Few of the common names are HBase, Hive, Pig and Ozzie.

Let’s discuss about the hadoop distributions. There are mainly 3 types.

  1. 100% Open source - Apache hadoop
  2. Commercial - Cloudera, Hortonworks, MapR
  3. Cloud - Microsoft Azure HDInsight, AWS

Most of the enterprises are stay on one to two full versions behind the currently released version of the hadoop. Because they consider the open source software to be immature and not ready for use in a enterprise professional setting.

Now you might have a question. If there’s an open source framework is available, Why people do use commercial one? It’s because, they do wrap around some version of the open source distribution. Also they do provide additional tooling, monitoring and management along with other libraries.

When you're using a cloud distribution you can use an Amazon Distribution which implements the open source version of Hadoop.If needed you can use a commercial version that's implemented on the AWS cloud such as MapR on AWS. Also keep in mind that, not all commercial versions are available on all clouds. :)
That should be a consideration when you're selecting a cloud-based Hadoop distribution.

Why enterprises do use hadoop?

1. Faster - Parallel processing
It is using parallel data processing method, which implements using MapReduce processing algorithm. The processing is done in batch and it’s implemented on each of the server nodes as well. Hence this can result much faster overall processing of large amounts of data.

Below is the high level architecture diagram of hadoop.

Image Credit : OpenSource.com


2. Cheaper
Since hadoop is using commodity hardware you don’t need to spend much money to store the petabyte size data or more.

Why hadoop is a best solution for enterprises?

1. Enhance the market research - Since you have the power of analysing large amount of data, hadoop helps you to carry out detailed market research and enhancing your products and services based on those decisions

2. Transactional Analysis -  Relational databases as being the stores for your current transactions. What about your history of transactions? What if you can analyze the history of all transactions for all locations and you are.
If you are able to look at all transactions and then determine what customers purchased your products in a certain time period in similar locations. Again, behavioral data which is resulting in better business decisions.

3. Data access centralization - Centralizing conventional data often a challenge and blocked the enterprise from working as one team. But big data has solved this problem, by offering visibility of the data throughout the organization.


Organizations using hadoop.

  1. Amazon
  2. Facebook
  3. eBay
  4. Yahoo
  5. IBM and many more...

Comments

Popular posts from this blog

CAP THEOREM

Quality Assurance in Agile Software Development

Hash Functions