A Data Nerd's view on AI, Machine learning and the Internet of Things
Being a “Data Scientist” Is As Much About IT As It Is Analysis by Carla Gentry, aka @Data_nerd
IBM defines the data scientist as -> A data scientist represents an evolution from the business or data analyst role. Data scientists of today don’t just crunch numbers; they view the universe as one large data set and work to decipher relationships in the data.
The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.
Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization. The data scientist role has been described as “part analyst, part artist.”
Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”…
A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.
Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.
IBM hits the nail on the head with the above definition. Having worked with traditional data analysts as well as programmers, developers, architects, scrum masters, and data scientists — I can tell you they don’t all think alike. A data scientist could be a statistician but a statistician may not be completely ready to take on the role of data scientist, and the same goes for all the above titles as well.
Beth Schultz from All Analytics mentioned that we are like jacks of all trades but masters of none; I don’t completely agree with this comment, but do agree that my ETL skills are not as honed as my analysis skills, for example. My definition of the data scientist includes: knowledge of large databases and clones, slave, master, nodes, schemas, agile, scrum, data cleansing, ETL, SQL and other programming languages, presentation skills, Business Intelligence and Business Optimization — plus the ability to glean actionable insight from data. I could go on and on about what the data scientists needs to be familiar with, but the analysis part has to be mastered knowledge and not just general knowledge. If you want to separate the pretenders from the experienced in this business, ask a few questions about how data science actually works!
When I start working with a new data set (it doesn’t matter how much or what kind), the first question I usually ask is, what kind of servers do you own?
Why would you need to know about the servers to work with data? I ask this question so I will know what kind of load it can handle – is it going to take me 9 hours to process or 15 minutes? How many servers do you have? I ask this because if I have 4 or 5 servers, I can toggle or load balance versus having only 1 that I have to babysit.
What kind of environment will I be working in? I ask this because I need to know if they have a test environment versus a live environment, so I can play without crashing every server in the house and ticking a lot of people off. If you are working with lots of data, lower peak times or low load times are better for live, as compared to test or staging environments where you can “play” without fear. This way, you won’t “bring down the house”.
It’s a good idea for you Chief Marketing Officers (CMOs) to let your Data Scientist work in the evening hours and/or on weekends, at their homes if applicable. This, of course, requires setting up a VPN connection and it also depends on how secure the data connections are, as well as how much processing I can do before I crash them, – um, I mean, what is the speed and capacity to process? If a dial-up connection is all that’s available, forget it.
As a side note, I’ve crashed many a server in my day – how do you think I learned all this stuff? Back in the Nineties, someone would crash the mainframe at RJKA and we would all head to Einstein’s Deli in Oak Park, IL but today, this might be frowned upon. But I digress, back to more IT related things.
Another handy thing to find out is how the databases are joined. By that I mean, what variables do they have in common (i.e., “primary keys”)? Are the relationships one-to-one, one-to-many, or many-to-many? Why would you ask this? Some programmers (I don’t mean this in general) don’t completely understand relational databases, especially when it comes to transactional data and data that needs to be refreshed often. You have to set up a database like you would play chess: think at least three moves ahead.
Additionally, some programmers/developers use too many JOIN statements in their scripts, which cause large amounts of iterations. Since these tend to increase run time and are not very efficient, you don’t want to be linking too many of these babies together and then running complex algorithms or scripts.
Sometimes, it’s better to start from scratch and build your own data source. When writing scripts to extract or refresh data, don’t forget a few keys things: normalize, index, pick your design based on what you know about the data and what is being requested of it.
Servers are important, and if dealing with large databases, load balance or toggle whenever possible. Also, star schema versus snowflake schema is important, so please put some serious thought into this. Ask yourself, do I need it fast or efficient? Believe me, I always pick efficient (I am a nerd, after all) but if the client needs it ASAP, then the client shall have it ASAP.
With knowledge of the client’s IT setup from a data management/quality perspective, you’ll be equipped to handle most situations you run into when dealing with data, even if the Architect and Programmer are out sick. Your professional knowledge is going to be a big help in getting the assignment or job complete.
Happy data mining and please play with data responsibly!
About the Author
During the past 20+ years, my client list is private but I have worked with Fortune 100 and 500 companies including but not limited to, Discover Financial Services, J&J, Hershey, Kraft, Kellogg’s, SCJ, McNeil, Firestone, PBA, Disney, Deloitte, Talent Analytics, Samtec + more.
Acting as a liaison between the IT department and the Executive staff, I am able to take huge complicated databases, decipher business needs and come back with intelligence that quantifies spending, profit and trends. Being called a data nerd is a badge of courage for this curious Mathematician/Economist because knowledge is power and companies are now acknowledging its importance. To find out more about what I do, please visit my profile on LinkedIn
“Big Data needs Data Science but Data Science doesn’t need Big Data” Carla Gentry aka @data_nerd
Data science has been around for decades, and it’s not just big data. I hear a lot of people clumping these two together like they go hand-in-hand, which I agree with to an extent. However, big data needs data science but data science doesn’t necessarily need big data. Most of the data a typical company handles on a daily basis or house internally is not big data. Even Facebook and Google break up or segment their data into workable pieces. Data science is big, small, structured, unstructured, messy, clean, etc… It’s more than just analytics. As a data scientist, you’ll become a liaison between the IT department and the C suite. You have to talk both languages and you have to understand the hierarchy of data, you can’t be just an architect or data expert.
What really matters in data science is the team effort and your role as a liaison. Your company has large amounts of data and you want to make sure your queries are correct. Whatever tool you use, make sure you have your data cleansed. You want to know that it’s normalized and indexed so that things run smoother. You want to be able to give insight, which requires knowledge of your audience. If your audience is the C suite of a multi-million dollar company, you’re going to need everything you have to back up your conclusions. Be able to prove it and be prepared for questions.
What sort of personality makes for an effective data scientist?
Definitely curiosity, I remember in college, my professors shut the door if they saw me coming because telling me that a2 + b2 = C2 was never enough. I wanted to know why. So the biggest question in data science is “why?” Why is this happening? If you notice that there’s a pattern, ask “why?” Is there something wrong with the data or is this an actual pattern going on? Can we conclude anything from this pattern? A natural curiosity will definitely give you a good foundation.
For aspiring data scientists, where can they begin?
There are many positions you can get into to learn data science; it’s not just for data engineers. Personally, I started as a junior analyst. Everyone has to start at the ground floor but there are so many resources and open-source data places you can go to practice. Most IT departments aren’t going to give you access to their live database, but they may give you access to their development database where you can go in and practice. Any position that you get into, go tell your boss that you’re interested in becoming a data scientist. Sign up for courses, learn programming languages and learn business. You have to know about budgets and various business aspects, not just the analysis part and not just the IT part. Data science is a wonderful field, and I encourage anyone that has a curiosity about data analysis, hypothesizing, statistics, to give it a shot. Just know that it won’t happen overnight.
Carla Gentry is the owner and chief data scientist for Analytical Solution. Analytical Solution was founded with the aim of aiding companies without their own designated analytics department, who need analysts on a per-project contract.