Technical

The rise of Data Science

featured 7

The shape of things to come

In the past few months, I think we’ve witnessed the end of the “Big Data” hype,
and the rise of the “Data Science”.

To me, this is the natural transition from buzzword to useful technology.

I’m really thrilled to see enthusiastic people challenge themselves on Kaggle,
organizations fund the Jupyter project, and platforms like plot.ly get a wider
attention.

Today, I see more and more analysts turn their Excel workbooks into Jupyter
notebooks, share data and insight within their company, for everyone to see and
act upon.

To me, we’re finally achieving the years-old dream of sharing data across
organizations, even from teams that are seemingly unrelated.
We’re at the verge of synergy.

But this dream is still fragile and the biggest threat for all dreams is the
same: disappointment.

A fragile dream

To be used, a data analysis service should of course provide insightful data,
but also be dependable, and that’s where the problems start.

Software design and deployment is a craft, as much as data science, and rare are
the ones that can excel at both.

Most people in the field, having to deal with complicated software stacks,
spending their day within the command line, will be happy to setup their own
solution to expose their latest machine learning service.

And that’s perfectly fine and normal, because that’s what makes most sense in
today’s world of tight R&D budgets, and because data analysts don’t usually come
from a CS curriculum.

Self-inflicted wounds

But soon you end up with as many custom-built servers as you have teams within
the organization, each implementing their own security model, authentication
backends, mail notification, paging and load-balancing.

Add to the mix the high turnover in the industry, GitHub-stars-based platform
selection, unpinned dependencies, … good luck with keeping the services up and
running in a few months.

Of course, in the corporate environment, IT services provide infrastructures and
often help with deployment of business-produced applications. But the
production of said applications is either:

  • fully delegated to the analysts (who
    usually lack experience in developing reliable applications)
  • responsibility to the IT department, who has to grasp the essence of the analyst
    work (an equally daunting task).

To me, having so many business people writing their own services, automating
their work, producing sophisticated machine learning algorithms, is an exciting
opportunity for organizations.

But if they fail to deliver what can be expected from a professional service,
it might quickly go to waste, to everyone’s loss.

Bridging the gap between data science and IT

Basically, if you produce code for a living, then you
should (learn to) code properly
.

If you are assembling a team of data scientists, you might consider this setup:

Infrastructure team

  • pure IT, you want sysadmins and networking people here
  • provide the infrastructure, connectivity, run centralized services such as:
    • LDAP
    • monitoring
    • continuous integration/deployment
    • backups
    • firewall
    • version control
  • provide a standardized way to deploy web services

Data Science team

  • this should be your main team in terms of headcount
  • play with models and data
  • write services
  • invest some time in basic computer science training:
    • version control
    • algorithms and programming best practices
    • following a style guide (to balance bus factor and high turnover)
    • test driven development and writing good APIs
    • networking notions
    • performance tuning rules of thumb

Facilitators

  • if your team reaches 5-8 people, you might want to add facilitators to solve
    day-to-day technical issues
  • preferably developers with good experience in data science
  • daily review the code produced by the data
    science team to ensure compliance with the deployment requirements
  • translate and follow-up data science team requests for infrastructure
    needs:

    • adding a website to the firewall whitelist so it can be scraped
    • providing servers with GPUs
    • restoring backups

You should be able to compose such a team with people you already work with,
and quickly start shipping valuable services for your organization.

Note: Adimian provides training for non-developers writing code. Get in touch to learn more!