As Big Data grows into something larger than life, Hadoop has evolved from a blunt instrument into an elegant platform capable of supporting the most advanced data-processing technologies in the world. And with over 14,000 servers running in our Hadoop warehouse, Rocket Fuel’s data infrastructure is among the most advanced and scalable anywhere.
That’s why when engineering teams worldwide flocked to the Hadoop Summit to find out what’s new with Hadoop 2.0’s omnipresent platform, Rocket Fuel’s presentations garnered some of the most intense interest. The scale and velocity with which we run our predictive technologies presents a set of unique challenges that those with the Herculean task of maintaining a large-scale distribution system were eager to explore.
Below, download our presentations and find out what Rocket Fuel's Big Data experts had to share at this year's Hadoop Summit.
Title: Hado“OPS” or Had “oops”
Track: Hadoop Deployment & Operations
Focus: Highly Technical
Authors: Kishore Yellamraju and Abhijit Pol
Maintaining large-scale distributed systems is a herculean task and Hadoop is no exception. The scale and velocity that we operate at Rocket Fuel presents a unique challenge. We observed 5 fold PB growth in our data and 5 fold number of machines, all in just a year’s time. As Hadoop became a critical infrastructure at Rocket Fuel, we had to ensure scale and high availability so our reporting, data mining, and machine learning could continue to excel. We also had to ensure business continuity with disaster recovery plans in the face of this drastic growth. In this presentation, we will discuss what worked well for us and what we learned 9the hard way). Specifically, we will (a) describe how we automated installation and dynamic configuration using Puppet and InfraDB (b) describe the performance tuning for scaling Hadoop (c) talk about the good, bad, and ugly of scheduling and multi-tenancy (d) detail some of the hard-fought issues (e) brief our Business-Continuity Plans and Disaster Recovery (f) touch upon how we monitor our Monster Hadoop cluster, and finally, (g) share our experience of Yarn-at-Scale at Rocket Fuel.
Title: How did you know this Ad will be relevant for me?!
Track: Data Science & Hadoop
Focus: Mostly Technical/Some Business
Authors: Savin Goyal and Sivasankaran Chandrasekar
Predicting the most relevant ad at any point in time for every individual is how Rocket Fuel optimizes ROI for an advertiser. One of the factors influencing this prediction is a consumer's online interactions and behavioral profile. With more than 45 billion interactions being processed daily, this data runs into several Petabytes in our Hadoop warehouse. Running machine-learning algorithms and Artificial Intelligence on this vast scale requires many practical issues to be addressed. First, behavioral patterns are shortlived, so to accurately reflect the tendencies of a consumer, we need to curate and refresh his or her profiles as quickly as possible while avoiding multiple scans over the raw data and dealing with issues like transient system outages. Second, we must address the difficulty of building models utilizing behavioral profiles without overwhelming our Hadoop cluster. At this scale, frequent refreshes of several models can place an undue burden on even a thousand-node cluster. In this talk, we will dive into (a) the practical challenges involved in designing a highly scalable and efficient solution to build behavioral profiles using Hadoop framework and (b) techniques for ensuring reliability and availability of mission-critical machine-learning pipelines.