(Last updated: 05/16/2019 | PDF)
This year’s Gartner report on data science and machine learning platforms was a disappointing read. It turned me off right off the bat with the notion of “citizen data scientists”. Later when my eye met with the term “maverick data scientists”, I told myself: Gartner, you really jumped the shark here! The liberal use of made-up terms made me question Gartner’s grasp on the analytics ecosystem and the character traits of various analytical personas: their skills sets, the activities they carry out day in and day out, and the way they interact with analytics software.
Three Dimensions of Data Science
Traditional sciences have been characterized by the empirical research method which involves careful collection of evidence in a domain followed by formulating and testing of hypothesis which subjects (small) data to rigorous mathematical theories. Nowadays when almost all domains are inundated with ambient data which are not necessarily collected per a priori study design, data becomes the subject of scientific study: does the data contains any evidence? If yes, how to refine data into features and further derive knowledge from features? Since data arrives and accumulates so fast, the sheer velocity and volume makes it impossible to handle them in batch on a single computer. Knowledge in computer science (CS) is indispensable to devise parallel algorithms and distributed storage schemes to efficiently process big data. In a word, data science is the intersection of math, computer science and domain knowledge.
True data scientists are then supposed to be versed in all three dimensions of data science, which is a tall order indeed. While the skill sets of data scientists may span the full spectrum of the Math-CS-Domain cubical space, analytics vendors should focus on analytical personas that are experts in at least one of the dimensions.
Three Types of Analytical Activities
While customer’s analytical needs are uneven across departments and continuous, what analytics vendors can offer are discrete. To close the gaps between an enterprise’s business needs and its analytics platform of choice, three distinct type of analytical activities are carried out during the firm’s day to day operations:
- Customization. For business operations with well-defined concepts and workflows, customer buy off-the-shelf solutions and customize within the confines of the solution to fit their specific needs. For example, a forecasting solution provides automatic sales predictions based on historical data, but customer may want to override them with specific numbers or logic (
IF weekends, prediction * 1.2
). - Application Development. For related tasks that are repeatedly carried out by multiple people across the firm, it makes sense to abstract the common features of the tasks into an analytical app. Most analytical apps are for end user consumption, such as a self-service business report app where user may produce a sales report for specific rep and region. When an analytical app is geared toward internal data scientist / analyst’s use, it’s called a tool. For example, the statisticians in a CRO may take advantage of a statistical power analysis app to decide the adequate sample size for their study.
- Ad Hoc Analysis. For tasks that are one-off, too small or too unique, all bets are off and data scientists/analysts must start from scratch, at the foundation level of the analytics platform. For example, UNC Hospital System VP needs to know whether and when to build a hospital at west Cary. Demographic, business and economic data needs to be collected, cleaned and analyzed. A final report needs to be submitted, at which point the task is done.
Four Modes of Human Analytics Interaction
With 50 years of Human-Computer Interaction research to draw upon, Human analytics interaction have evolved from imperative to declarative, from manipulative to demonstrative.
Imperative
In imperative mode, you tell analytical software system how you want it to get what you want, step by step. You must worry both the logic and control flow of your computation.
For example, to choose the odd numbers among integers less than 10:
ListintsLtTen = new List {0,1,2,3,4,5,6,7,8,9};
We’d like to analytics software to carry out below steps:
- Create a container for the result set
- Step through each number in the input set
- Check the number, if it’s odd, add it to the results set
And in code form:
ListoddInts = new List (); foreach(var num in intsLtTen) { if (num % 2 != 0) oddInts.Add(num); }
Declarative
In declarative mode, you describe to the analytical software system what you want in the end, not how to get there. You only worry about the logic of your computation in declarative mode.
To choose the odd numbers among integers less than 10 in declarative fashion:
var oddInts = intsLtTen.Where( num => num % 2 != 0);
For another example in user interface construction, a declarative way to build a sales report form with two combo boxes and one button could be:
Manipulative
Imperative and declarative mode are mostly textual, in the sense that you work with a text editor churning out code, XML or JSON files. The collateral of being textual is syntax: wrong indentation levels in Python, missing ending tag in XML…. Syntax can be an insurmountable barrier to overcome for most of the minds that are not nerdy or geeky enough.
Introduce direct manipulation, “an interaction style in which the objects of interest in the UI are visible and can be acted upon via physical, reversible, incremental actions that receive immediate feedback.” In the manipulative mode of human analytics interaction, domain objects are directly manipulable, without having to descend into the quagmire of syntax. Manipulative mode can be layered upon imperative or declarative mode.
For example, The Sales Report form can be built by dragging and dropping UI widgets from a widget library followed by configuring the properties of the widget.
For an example of building an imperative analytics piece by direct manipulation, below is a reincarnation of the imperative example program built with Lego blocks from Google’s Blockly:
Demonstrative
The downside of manipulative mode is the loss of generalization and abstraction. If the same task needs to be performed on a large set of similar objects, you must select one object, perform the task, select another object, perform the same task, so on so forth. How one wished a way to tell the analytics software in one command: perform the task on EACH of the objects!
Enter programming by demonstration, where “end-user demonstrate actions on concrete examples and the system records user actions and infers a generalized program that can be used on new examples”.
Demonstration goes both ways. In the first part, end-user demonstrates to the analytical software system what to do and the system records the action. This is already useful enough as the recorded actions are normally in the form of imperative or declarative programs, and users may learn the syntax and modify the recorded program.
On the other hand, based on user’s action, the analytical software system may infer and demonstrate multiple options to the user and solicit from user the one to act on. For example, after user deleted one empty line in a file, the system generalizes user’s action to the option of “deleting all empty rows” which may just be exactly what the user wanted to do.
Putting it together
Now that we went through the three dimensions of data science (Math, CS, Domain knowledge), the three types of analytical activities (Customize, App development, Ad hoc analysis) and the four modes of human analytics interaction (Imperative, Declarative, Manipulative, Demonstrative), it’s time to loop back to the question that motivated this article: what are the good names for different analytical personas? Well, I don’t know. What I do know is that analytical talents in a typical enterprise span the full spectrum of below grid. Even as an individual, the best data scientist tends to be the jack of all trades. They work across multiple type of analytics activities and switch to the most efficient mode of interactions with analytics as needed. The true analytics platform should support analytical users where they are, given their skill sets, the analytical activity they’re in and their comfortable mode of interacting with analytics. For example, forcing a Python enthusiast to program in SAS data step is not a winning strategy, because they can walk to the next shop that doesn’t use SAS.
References:
- Direct manipulation: A step beyond programming languages.The paper that gives us the digital world as of today.
- Wrangler: Interactive visual specification of data transformation scripts. A Trifacta paper.
- Programming by examples, and its applications in data wrangling. A Microsoft paper.
- The origins of data science is an interesting read.
- Statistical Modeling: the two cultures