Big data refers to the huge volume of data that cannotbe stored and processed with in a time frame intraditional file The next question comes in mind is how big this dataneeds to be in order to classify as a big There is alot of misconception in referring a term big Weusually refer a data to be big if its size is in gigabyte,terabyte, Petabyte or Exabyte or anything larger thanthis This does not define a big data Even a small amount of file can be referred to as a bigdata depending upon the content is being Let’s just take an example to make it If we attacha 100 MB file to an email, we cannot be able to do As a email does not support an attachment of this Therefore with respect to an email, this 100mb filecan be referred to as a big Similarly if we want toprocess 1 TB of data in a given time frame, we cannotdo this with a traditional system since the resourcewith it is not sufficient to accomplish this As you are aware of various social sites such asFacebook, twitter, Google+, LinkedIn or YouTubecontains data in huge But as the users aregrowing on these social sites, the storing and processingthe enormous data is becoming a challenging Storing this data is important for various firms togenerate huge revenue which is not possible with atraditional file Here is what Hadoop comes inthe Big Data simply means that huge amountof structured, unstructured and semi-structureddata that has the ability to be processed for Now a days massive amount of dataproduced because of growth in technology,digitalization and by a variety of sources, includingbusiness application transactions, videos, picture ,electronic mails, social media, and so So to processthese data the big data concept is Structured data: a data that does have a proper formatassociated to it known as structured For examplethe data stored in database files or data stored in Semi-Structured Data: A data that does not have aproper format associated to it known as structured For example the data stored in mail files or in Unstructured data: a data that does not have any formatassociated to it known as structured For examplean image files, audio files and video Big data is categorized into 3 v’s associated with it thatare as follows:[1]Volume: It is the amount of data to be generated in a huge Velocity: It is the speed at which the data Variety: It refers to the different kind data which A Challenges Faced by Big DataThere are two main challenges faced by big data [2] How to store and manage huge volume of How do we process and extract valuableinformation from huge volume data within a giventime These main challenges lead to the development ofhadoop Hadoop is an open source framework developed byduck cutting in 2006 and managed by the apachesoftware Hadoop was named after yellowtoy Hadoop was designed to store and process Hadoop framework comprises of two maincomponents that are: HDFS: It stands for Hadoop distributed filesystem which takes care of storage of data withinhadoop MAPREDUCE: it takes care of a processing of adata that is present in the HDFSNow let’s just have a look on Hadoop cluster:Here in this there are two nodes that are Master Nodeand slave Master node is responsible for Name node and JobTracker Here node is technical term used todenote machine present in the cluster and demon isthe technical term used to show the backgroundprocesses running on a Linux The slave node on the other hand is responsible forrunning the data node and the task tracker The name node and data node are responsible forstoring and managing the data and commonly referredto as storage Whereas the job tracker and tasktracker is responsible for processing and computing adata and commonly known as Compute Normally the name node and job tracker runs on asingle machine whereas a data node and task trackerruns on different B Features Of Hadoop:[3] Cost effective system: It does not require anyspecial It simply can be implementedin a common machine technically known ascommodity Large cluster of nodes: A hadoop system cansupport a large number of nodes which providesa huge storage and processing Parallel processing: a hadoop cluster provide theaccessibility to access and manage data parallelwhich saves a lot of Distributed data: it takes care of splinting anddistributing of data across all nodes within a it also replicates the data over the entire Automatic failover management: once and AFMis configured on a cluster, the admin needs not toworry about the failed Hadoop replicatesthe configuration Here one copy of each data iscopied or replicated to the node in the same rackand the hadoop take care of the internetworkingbetween two Data locality optimization: This is the mostpowerful thing of hadoop which make it the mostefficient Here if a person requests for ahuge data which relies in some other place, themachine will sends the code of that data and thenother person compiles it and use it in particularas it saves a log to Heterogeneous cluster: node or machine can beof different vendor and can be working ondifferent flavor of operating Scalability: in hadoop adding a machine orremoving a machine does not effect on a Even the adding or removing the component ofmachine does C Hadoop ArchitectureHadoop comprises of two HDFS MAPREDUCEHadoop distributes big data in several chunks and storedata in several nodes within a cluster whichsignificantly reduces the Hadoop replicates each part of data into each machinethat are present within the The of copies replicated depends on the By default the replication factor is Thereforein this case there are 3 copies to each data on 3 differentmachines。reference:Mahajan, P, Gaba, G, & Chauhan, N S (2016) Big Data S IITM Journal of Management and IT, 7(1), 89-自己拿去翻译网站翻吧,不懂可以问