Webinar: Working Together at the Intersection of Data Science and Data Engineering
The traditional approach
- The business has a question/problem that needs to be solved.
- The data scientists go away “into their corner” to work on the question/problem.
- They have an answer!
- The model (algorithm) is “thrown over the wall” to the data engineers. Although this is the first time they see it, they’re expected to deploy it into production.
- Then the data engineering team has to scale it.
- However, it doesn’t scale.
- Now what? The data engineering team has to unpack the model, see what doesn’t work, and figure out if and how it can be distributed—creating a ping-pong dynamic between the two teams.
Why the traditional approach doesn’t work
- The data scientist is focused on the problem (rightly so), but in a vacuum.
- The data scientist lacks an understanding of the infrastructure/platform that the models must be executed in.
- The data engineer sees the model at the end, but not while it’s being designed/created.
How to work as a team
- Create a “2 pizza” team of data scientists and data engineers, and get these teams to work really well together. What’s the definition of a “2 pizza team?” You guessed it—two pizzas will feed them all, which means the size of the team matters.
- Look at the problem/question together, making use of brainstorming and white-boarding ideas. This provides the opportunity for input from two different perspectives:
- How do I get the data?
- How do I look at the algorithm?
- What do I need to do to find the problem related to that question?
- What data sets do we need to pull in?
- Are they going to be real-time?
- Do I need to complete some streaming pipeline?
- What do I have already there?
- What does my infrastructure look like and will this work?
- Look at the solutions (models) as a team, and see if they fit the current infrastructure/platform. This is where any deployment issues can be evaluated and resolved. If something needs to be changed, it can be addressed before getting to production.
- Test the model in “The Wild.” Do this as a team in some type of testing or integration environment.
- Deploy it into production together with the customer, and make it “Drama-free.”
Why blend the teams?
- Provides the ability to either solve the problem faster or fail faster—which leads to more timely and efficient results.
- Creates better teamwork. Both parties have “skin in the game” and want to be successful.
- Provides the ability to build a data platform with data science in mind – instead of having the engineers build a platform that the data scientists can’t use.
- Provides the opportunity to learn from each other. By cross-pollinating skills, both disciplines benefit by increasing understanding and efficiency.
- Stops finger-pointing. If the outcome isn’t successful, it’s the whole team’s fault.
From Idea to Production
- Incremental value
- Time to market
- Economically viable implementation
- Cost avoidance
- Brand benefit
The Experimental Enterprise
Agile Data Science Basics
- The Project Charter identifies the desired end point, and expected timeline for getting there.
- There is a plan for the overall project which charts the investigation themes that will be focused on over the expected timeline.
- The project is organized and run in sprints—(typically) two week increments of work.
- Work is organized into stories—specific tasks necessary for reaching the goal which can be reasonably expected to be completed in the sprint.
- Each sprint has a regular cadence of coordination and feedback meetings:
- Kickoff—populate the backlog with the stories selected for the current sprint and assign responsibility for each story.
- Standups—daily, BRIEF coordination meetings that include everyone.
- Retrospective—the time to show what’s been accomplished and steer the project based on lessons learned, “product” feedback, etc.