Big Data: Principles and best practices of scalable realtime data systems

Big Data: Principles and best practices of scalable realtime data systems

Nathan Marz

Language: English

Pages: 328

ISBN: 1617290343

Format: PDF / Kindle (mobi) / ePub


Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy-to-understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they're built.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the Book

Web-scale applications like social networks, real-time analytics, or e-commerce sites deal with a lot of data, whose volume and velocity exceed the limits of traditional database systems. These applications require architectures built around clusters of machines to store and process data of any size, or speed. Fortunately, scale and simplicity are not mutually exclusive.

Big Data teaches you to build big data systems using an architecture designed specifically to capture and analyze web-scale data. This book presents the Lambda Architecture, a scalable, easy-to-understand approach that can be built and run by a small team. You'll explore the theory of big data systems and how to implement them in practice. In addition to discovering a general framework for processing big data, you'll learn specific technologies like Hadoop, Storm, and NoSQL databases.

This book requires no previous exposure to large-scale data analysis or NoSQL tools. Familiarity with traditional databases is helpful.

What's Inside

  • Introduction to big data systems
  • Real-time processing of web-scale data
  • Tools like Hadoop, Cassandra, and Storm
  • Extensions to traditional database skills

About the Authors

Nathan Marz is the creator of Apache Storm and the originator of the Lambda Architecture for big data systems. James Warren is an analytics architect with a background in machine learning and scientific computing.

Table of Contents

  1. A new paradigm for Big Data
  3. Data model for Big Data
  4. Data model for Big Data: Illustration
  5. Data storage on the batch layer
  6. Data storage on the batch layer: Illustration
  7. Batch layer
  8. Batch layer: Illustration
  9. An example batch layer: Architecture and algorithms
  10. An example batch layer: Implementation
  12. Serving layer
  13. Serving layer: Illustration
  15. Realtime views
  16. Realtime views: Illustration
  17. Queuing and stream processing
  18. Queuing and stream processing: Illustration
  19. Micro-batch stream processing
  20. Micro-batch stream processing: Illustration
  21. Lambda Architecture in depth

WCF 4.0 Multi-tier Services Development with LINQ to Entities

The Synthesis of Three Dimensional Haptic Textures: Geometry, Control, and Psychophysics (Springer Series on Touch and Haptic Systems)

Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation Series)

Collaborative Web Hosting: Challenges and Research Directions (Springer Briefs in Computer Science)

International Encyclopedia of Systems and Cybernetics












conclusions would you draw about the impact of Google’s announcements? Figure 2.5 A summary of one day of trading for Google, Apple, and Amazon stocks: previous close, opening, high, low, close, and net change. Licensed to Mark Watson 32 CHAPTER 2 Data model for Big Data Apple held steady throughout the day. Google’s stock price had a slight boost on the day of the announcement. Amazon’s stock dipped in late-day trading. Figure 2.6 Relative stock price changes of

each group. This transformation is illustrated in figure 6.18. sentence word the dog dog fly to the moon fly fly to the moon to fly to the moon moon dog dog Aggregator: count () -> (count) word count dog 2 fly 1 to 1 moon 1 Group by: [word] sentence word the dog dog dog dog fly to the moon fly fly to the moon to fly to the moon moon Figure 6.18 Illustration of pipe diagram group by and aggregation Licensed to Mark Watson 104 CHAPTER 6

composition but Pig doesn’t is entirely due to fundamental differences in how computations are expressed in JCascalog versus Pig. We’ll cover this functionality of JCascalog in depth later—the takeaway here is the importance of abstractions being composable. There are many other examples of composition that we’ll explore throughout this chapter. Now that you’ve seen some common sources of complexity in data-processing tools, let’s begin our exploration of JCascalog. 7.3 An introduction to

/tmp/swa/equivs{iteration number} for the path. These outputs will consist of 2-tuples of PersonIDs. The following code creates the initial dataset by transforming the equiv edge objects stored in the master dataset: public static class EdgifyEquiv extends CascalogFunction { public void operate(FlowProcess process, FunctionCall call) { Data data = (Data) call.getArguments().getObject(0); EquivEdge equiv = data.get_dataunit().get_equiv(); call.getOutputCollector() .add(new Tuple(equiv.get_id1(),

"?domain", "_", "?num-user-visits", "?num-user-bounces") .predicate(new Sum(), "?num-user-visits") .out("?num-visits") Licensed to Mark Watson Sorts pageviews chronologically to analyze visits—the Option.SORT predicate allows you to control how each group is sorted before being fed to the aggregator operations Bounces and visits are determined per user. 175 Summary .predicate(new Sum(), "?num-user-bounces") .out("?num-bounces"); return bounces; } Sum bounces and visits

Download sample