preview

Nt1330 Unit 1 Problem Analysis Paper

Decent Essays

\subsection{Hadoop:}
Hadoop \cite{white2012hadoop} is an open-source framework for distributed storage and data-intensive processing, first developed by Yahoo!. It has two core projects: Hadoop Distributed File System (HDFS) and MapReduce programming model \cite{dean2008mapreduce}. HDFS is a distributed file system that splits and stores data on nodes throughout a cluster, with a number of replicas. It provides an extremely reliable, fault-tolerant, consistent, efficient and cost-effective way to store a large amount of data. The MapReduce model consists of two key functions: Mapper and Reducer. The Mapper processes input data splits in parallel through different map tasks and sends sorted, shuffled outputs to the Reducers that in turn groups and processes them using a reduce task for each group. …show more content…

When a file is written in HDFS, it is divided into fixed size blocks. The client first contacts the NameNode, which get the list of DataNode where actual data can be stored. The data blocks are distributed across the Hadoop cluster. Figure \ref{fig.clusternode} shows the architecture of the Hadoop cluster node used for both computation and storage. The MapReduce engine (running inside a Java virtual machine) executes the user application. When the application reads or writes data, requests are passed through the Hadoop \textit{org.apache.hadoop.fs.FileSystem} class, which provides a standard interface for distributed file systems, including the default HDFS. An HDFS client is then responsible for retrieving data from the distributed file system by contacting a DataNode with the desired block. In the common case, the DataNode is running on the same node, so no external network traffic is necessary. The DataNode, also running inside a Java virtual machine, accesses the data stored on local disk using normal file I/O

Get Access