Being a “Data Scientist” Is As Much About IT As It Is Analysis by Carla Gentry, aka @Data_nerd
IBM defines the data scientist as -> A data scientist represents an evolution from the business or data analyst role.
The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.
Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization. The data scientist role has been described as “part analyst, part artist.”
Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”…
A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.
Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.
IBM hits the nail on the head with the above definition. Having worked with traditional data analysts as well as programmers, developers, architects, scrum masters, and data scientists — I can tell you they don’t all think alike. A data scientist could be a statistician but a statistician may not be completely ready to take on the role of data scientist, and the same goes for all the above titles as well.
Beth Schultz from All Analytics mentioned that we are like jacks of all trades but masters of none; I don’t completely agree with this comment, but do agree that my ETL skills are not as honed as my analysis skills, for example. My definition of the data scientist includes: knowledge of large databases and clones, slave, master, nodes, schemas, agile, scrum, data cleansing, ETL, SQL and other programming languages, presentation skills, Business Intelligence and Business Optimization — plus the ability to glean actionable insight from data. I could go on and on about what the data scientists needs to be familiar with, but the analysis part has to be mastered knowledge and not just general knowledge. If you want to separate the pretenders from the experienced in this business, ask a few questions about how data science actually works!
When I start working with a new data set (it doesn’t matter how much or what kind), the first question I usually ask is, what kind of servers do you own?
Why would you need to know about the servers to work with data? I ask this question so I will know what kind of load it can handle – is it going to take me 9 hours to process or 15 minutes? How many servers do you have? I ask this because if I have 4 or 5 servers, I can toggle or load balance versus having only 1 that I have to babysit.
What kind of environment will I be working in? I ask this because I need to know if they have a test environment versus a live environment, so I can play without crashing every server in the house and ticking a lot of people off. If you are working with lots of data, lower peak times or low load times are better for live, as compared to test or staging environments where you can “play” without fear. This way, you won’t “bring down the house”.
It’s a good idea for you Chief Marketing Officers (CMOs) to let your Data Scientist work in the evening hours and/or on weekends, at their homes if applicable. This, of course, requires setting up a VPN connection and it also depends on how secure the data connections are, as well as how much processing I can do before I crash them, – um, I mean, what is the speed and capacity to process? If a dial-up connection is all that’s available, forget it.
As a side note, I’ve crashed many a server in my day – how do you think I learned all this stuff? Back in the Nineties, someone would crash the mainframe at RJKA and we would all head to Einstein’s Deli in Oak Park, IL but today, this might be frowned upon. But I digress, back to more IT related things.
Another handy thing to find out is how the databases are joined. By that I mean, what variables do they have in common (i.e., “primary keys”)? Are the relationships one-to-one, one-to-many, or many-to-many? Why would you ask this? Some programmers (I don’t mean this in general) don’t completely understand relational databases, especially when it comes to transactional data and data that needs to be refreshed often. You have to set up a database like you would play chess: think at least three moves ahead.
Additionally, some programmers/developers use too many JOIN statements in their scripts, which cause large amounts of iterations. Since these tend to increase run time and are not very efficient, you don’t want to be linking too many of these babies together and then running complex algorithms or scripts.
Sometimes, it’s better to start from scratch and build your own data source. When writing scripts to extract or refresh data, don’t forget a few keys things: normalize, index, pick your design based on what you know about the data and what is being requested of it.
Servers are important, and if dealing with large databases, load balance or toggle whenever possible. Also, star schema versus snowflake schema is important, so please put some serious thought into this. Ask yourself, do I need it fast or efficient? Believe me, I always pick efficient (I am a nerd, after all) but if the client needs it ASAP, then the client shall have it ASAP.
With knowledge of the client’s IT setup from a data management/quality perspective, you’ll be equipped to handle most situations you run into when dealing with data, even if the Architect and Programmer are out sick. Your professional knowledge is going to be a big help in getting the assignment or job complete.
Happy data mining and please play with data responsibly!
About the Author