Webinar: Working Together at the Intersection of Data Science and Data Engineering

The scientists vs. the engineers. In the world of big data, the two often rub elbows—but infrequently work together as a team. Within their isolated worlds, data scientists traditionally create models and then hand them off to data engineers to implement them. The use of such a siloed paradigm impacts quality, efficiency and cost.

In this webinar, presented by the UC Berkeley School of Information, host Stephen O’Sullivan —VP, Engineering at Silicon Valley Data Science—explores why such an approach creates unnecessary headaches, as well as how working together as a blended team can lead to better results. With the use of real-world cases, he discusses how collaboration is key in helping the two disciplines to learn from one another—and optimize outcomes in the process.

About the Presenter: A leading expert on big data architecture and Hadoop, Stephen O’Sullivan brings over 20 years of experience creating scalable, high-availability, data and applications solutions. A veteran of WalmartLabs, Sun and Yahoo!, Stephen leads data architecture and infrastructure at Silicon Valley Data Science—a boutique consulting company around data science and data engineering that helps their customers take the data journey beyond technology to create business solutions.

The following is an overview of what O’Sullivan sees as a very effective method for creating great outcomes—with data scientists and data engineers working together as a team.

The traditional approach

O’Sullivan says the traditional siloed approach of tackling a data project usually goes something like this:

  1. The business has a question/problem that needs to be solved.
  2. The data scientists go away “into their corner” to work on the question/problem.
  3. They have an answer!
  4. The model (algorithm) is “thrown over the wall” to the data engineers. Although this is the first time they see it, they’re expected to deploy it into production.
  5. Then the data engineering team has to scale it.
  6. However, it doesn’t scale.
  7. Now what? The data engineering team has to unpack the model, see what doesn’t work, and figure out if and how it can be distributed—creating a ping-pong dynamic between the two teams.

Why the traditional approach doesn’t work

He notes the following dynamics that prevent the siloed approach from being effective:

  • The data scientist is focused on the problem (rightly so), but in a vacuum.
  • The data scientist lacks an understanding of the infrastructure/platform that the models must be executed in.
  • The data engineer sees the model at the end, but not while it’s being designed/created.

How to work as a team

O’Sullivan says that if the teams are blended, they can produce better results—using this approach to make it happen:

  • Create a “2 pizza” team of data scientists and data engineers, and get these teams to work really well together. What’s the definition of a “2 pizza team?” You guessed it—two pizzas will feed them all, which means the size of the team matters.
  • Look at the problem/question together, making use of brainstorming and white-boarding ideas. This provides the opportunity for input from two different perspectives:

The data scientist will want to know:

  • How do I get the data?
  • How do I look at the algorithm?
  • What do I need to do to find the problem related to that question?

The data engineer will want to know:

  • What data sets do we need to pull in?
  • Are they going to be real-time?
  • Do I need to complete some streaming pipeline?
  • What do I have already there?
  • What does my infrastructure look like and will this work?
  • Look at the solutions (models) as a team, and see if they fit the current infrastructure/platform. This is where any deployment issues can be evaluated and resolved. If something needs to be changed, it can be addressed before getting to production.
  • Test the model in “The Wild.” Do this as a team in some type of testing or integration environment.
  • Deploy it into production together with the customer, and make it “Drama-free.”

Why blend the teams?

O’Sullivan says there are many benefits to creating a blended team:

  • Provides the ability to either solve the problem faster or fail faster—which leads to more timely and efficient results.
  • Creates better teamwork. Both parties have “skin in the game” and want to be successful.
  • Provides the ability to build a data platform with data science in mind – instead of having the engineers build a platform that the data scientists can’t use.
  • Provides the opportunity to learn from each other. By cross-pollinating skills, both disciplines benefit by increasing understanding and efficiency.
  • Stops finger-pointing. If the outcome isn’t successful, it’s the whole team’s fault.

From Idea to Production

O’Sullivan says that when they start a project, they identify the business goals, distill those into use cases, and then work in iterative cycles to achieve tangible gains. When defining success of a project, they include the following fundamental components in the evaluation:

  • Incremental value
  • Time to market
  • Economically viable implementation
  • Cost avoidance
  • Brand benefit
  • Goodwill

The Experimental Enterprise

In order to get to data science in the experimental enterprise—which allows us to observe our experiments and respond to the changing environment— O’Sullivan says that building the appropriate foundation is essential. This includes first focusing on making infrastructure readily accessible (Cloud, DevOps, Open Source), then both supporting investigative work and building a solid layer for production (Agile, platforms and APIs).

Agile Data Science Basics

O’Sullivan says that although the following principles are the same as Agile in engineering, it’s not always easy for data scientists to adjust to using the same method:

  • The Project Charter identifies the desired end point, and expected timeline for getting there.
  • There is a plan for the overall project which charts the investigation themes that will be focused on over the expected timeline.
  • The project is organized and run in sprints—(typically) two week increments of work.
  • Work is organized into stories—specific tasks necessary for reaching the goal which can be reasonably expected to be completed in the sprint.
  • Each sprint has a regular cadence of coordination and feedback meetings:
  • Kickoff—populate the backlog with the stories selected for the current sprint and assign responsibility for each story.
  • Standups—daily, BRIEF coordination meetings that include everyone.
  • Retrospective—the time to show what’s been accomplished and steer the project based on lessons learned, “product” feedback, etc.

Use Cases

O’Sullivan provides the following specific use cases, and explains how his company achieved project goals:

TV Advertising Platform

Goal: Build a data platform to serve the company’s product needs, and take their data science aspirations to the next level.

O’Sullivan says the customer wanted to do analytics on a variety of data points per subscriber. Therefore, they had a data scientist on the team to make sure they were asking the questions they wanted answered, and a data engineer to actually build the platform that would meet that data science need. By working together and building a platform with data science in mind, they actually finished early and saved the customer money.

Major Global Brand

Goal: Sentiment analysis on product comments in different languages.

O’Sullivan says that this was initially an easy data science project, but then the company said they weren’t getting the data back fast enough from IT. Therefore, they had to change the model they were using for streaming the data in—which would have typically taken about eight weeks. However, since they had the data science team and the data engineering team working together, it only took about three weeks.

Gaming company

Goal: Build a real-time ingest pipeline, to perform predictive analytics on games events to serve target ads to customers.

O’Sullivan says that for this project, they worked with the customer’s data scientists to determine the exact timing needed for the target ads to appear.

By working together, data scientists and data engineers can break down siloes to improve quality, increase efficiency and decrease costs.

For a first-hand look at how these two disciplines can team up for better outcomes, check out the full webinar.

Citation for this content: datascience@berkeley, the online Master of Information and Data Science from UC Berkeley