where developers meet development
Friday,October 18,2019


The Future of Distributed Computing Rests with Hadoop

Picture Holder1

 Moore's law has finally hit the wall and CPU speeds have decreased in the last few years; the industry is reacting with hardware with more cores and software that can leverage 'grids' of distributed computing resources. Further, the assimilation of computing into our daily lives is enabling the generation of data at unprecedented rates. The amount of digital information churned out in 2011 is estimated to be 10 times that of what was produced in 2006; that is, 1800 exabytes. The rising number of web applications serving millions of Internet users and dealing with petabytes of data, the advent of cheap storage capacity resulting in a tremendous growth in data retention, and the availability of cheap resources to process that data have all reinstated the need for large-scale data processing.

Extracting information and intelligence from these data sets, commonly referred to as data analytics, is an important data intensive application stemming from this huge corpus of data. Data analytics is shown to be useful in several scenarios—analytics enable web data mining (for example, web indexing and search), enable extracting business intelligence (such as click stream analysis for increasing ad revenue), enable processing data sets from scientific studies and simulations (for research in natural language processing, seismic simulation and scene completion). Analyzing these large volumes of data demands a highly scalable solution.

Google's MapReduce is one popular approach that enables data analytics by parallely processing data, partitioned among large number of commodity machines. MapReduce can scale out to clusters of hundreds of commodity nodes as the data and processing demands scale, and can automatically handle failures in these large settings. Hadoop is a relatively new Java-based software framework that derives inspiration from Google’s MapReduce and FileSystem concepts.

Hadoop's Google influence doesn’t end there. Much like how the name Google does not mean anything, Hadoop's creator Doug Cutting, named it after his son's yellow stuffed elephant, afer discarding several others that were either already a web domain or trademarked. While the name was chosen for its originality and ease of pronunciation, Hadoop's basic purpose is to support applications that process vast amounts of data in a time efficient manner. Yahoo is Hadoop’s biggest supporter and it is extensively used in Yahoo’s search engine and advertising businesses.

In January 2010, Yahoo helped the Indian Institute of Technology in Bombay to set up a Hadoop cluster lab by donating a cluster of servers running the open-source Hadoop software. The cluster lab at Mumbai will help researchers at the institute study areas such as searching and ranking techniques, information extraction and natural language processing. Several others like Amazon, Adobe, AOL, Facebook, Google and Hulu use Hadoop clusters that scale up to thousands of computing nodes and petabytes of data.

Hadoop recently graduated to becoming a top-level project at the Apache Foundation. The Hadoop Distributed File System (HDFS) is a key component of Hadoop that is designed to store data on commodity hardware with high access bandwidth across the cluster.

HDFS = Hadoop Distributed File System

Hadoop's fault tolerant storage system called the Hadoop Distributed File System (HDFS), is capable to store huge amounts of information, scale up incrementally and survive the failure of significant parts of the storage infrastructure without losing data. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

The emphasis of high throughput instead of low latency makes HDFS appealing for batch processing. HDFS implements a simple data coherency model, write-once-read-many, that allows high throughput of data access. She portability of Hadoop makes integration easier with heterogeneous hardware and software platforms.

Hadoop at Great Indian Developer Summit

Hadoop is now considered the gold standard for the divide-and-conquer model of problem crunching. Matthew McCullough is an energetic 12 year veteran of enterprise software development and open source education. He is coming this summer to India's biggest polyglot conference and workshops for software professionals, Great Indian Developer Summit, to teach how to use the well-traveled Apache Hadoop framework.

Matthew is a member of the JCP and co-founder of Ambient Ideas. His experience includes successful J2EE, SOA, and Web Service implementations for real estate, financial management, and telecommunications firms, and several published open source libraries. His focus areas include Cloud Computing, Maven, iPhone, Distributed Version Control, and OSS Tools.

At GIDS 2010, Matthew will also be training IT professionals on other subjects such as Open Source Web Debugging Tools, covering the REST, HTML, SOAP, CSS, TCP, Filesystem and JavaScript facets of an app and look at utilities such as tcpdump, curl, Wireshark, JMeter, Firebug, JASH, Poster, SoapUI, Firediff, lsof, fs_usage, iwatch and more; iPhone application coding in Objective-C and integration with some favorite Java web service back-ends such as RESTful Grails; a whistle-stop tour of Maven 3.0, exploring the performance improvements, features that make debugging Maven issues easier, and changes to POMs that may require modifications to your build, but will result in more determinate build outputs

Matthew will also enable participants to leverage 100s of hours of research distilled into a 180 minute Cloud Computing boot camp on the Google App Engine, so they get bootstrapped with what cloud computing is and isn't, who the players are in this space, what unique features each offers, navigate through demos of building and deploying an app live to the Google App Engine, and the excellent tooling that the framework provides. Lastly, he will put a reality check on cloud computing, and GAE specifically, looking at pitfalls and gotchas.

Every year, GIDS is a game changer for several thousands of IT professionals, providing them with a competitive edge over their peers, enlightening them with bleeding-edge information most useful in their daily jobs, helping them network with world-class experts and visionaries, and providing them with a much needed thrust in their careers. Attend Great Indian Developer Summit to gain the information, education and solutions you seek. From post-conference workshops, breakout sessions by expert instructors, keynotes by industry heavyweights, enhanced networking opportunities, and more. Click here to register.

Register to DeveloperMarch

Be part of DeveloperMarch and get updates on upcoming events. Youcan register with the form below:

Job Title*
Phone Number*
Confirm Password*
E-mail address:*
Your Details:
Your name: *
E-mail address: *
Software Supportby Advanced Millennium Technologies

Advanced Millennium Technologies. Expertise in software development, offering consultancy services, Open source programming, CRM - Customer Relationship Management, CMS - Content Management System , ERP - Enterprise Resource Planning and Ecommerce development, AJAX, PHP, .NET, J2EE, SOA, XSLT, DOJO toolkit development and software testing. A robust onsite-offshore model. A well-defined global delivery model. AMT Outsourcing center. www.amt.inTAROBY - The E-Mail Dashboard for EntrepreneursTaroby is a SaaS based messaging and collaboration suite inbox that enables sharing of email accounts among team members. The unique concept of 'Team Inbox' makes Taroby an excellent enterprise collaboration suite for enterprises. Taroby is an effective tool for CEO's and entrepreneurs to manage multiple departments or manage multiple projects under them. The team inbox gives the entrepreneurs an overview of what is happening their business and give a quick snap shot of the employees who is responcible for handling the tasks/emails. For team members taroby brings in transparency and efficiency in their teams. Taroby improves the internal and external communication in an organization. Using the Taroby's Team Inbox also helps in reducing the usage of disc space and there by helping the enterprises to reduce carbon footprints.