A Data Nerd's view on AI, Machine learning and the Internet of Things

The Foundation of Generative AI: Why Clean, Modernized Data is Essential

Clean, modernized data is essential for the success of generative AI, which offers exciting possibilities across industries. Like a sculptor needing quality clay, generative AI algorithms require reliable data for meaningful outputs. Research inspired by a Wall Street Journal quote highlights data quality's critical role. Generative AI models learn from training data, so inaccurate data leads to biased results ("garbage in, garbage out"). Clean data enables accurate pattern identification, high-quality content generation, and valuable insights. It's crucial for interpreting user queries and the reliability of synthetic data. Even cleaning a small data portion can significantly improve AI accuracy.

Unstructured data (text, images, etc.) poses challenges due to inconsistent formats, difficulty in validation and verification, lack of standardized metadata, cleaning complexities, and integration issues with structured data. Addressing these requires robust cleaning and modernization techniques.

Various tools aid in this: Data Quality Tools identify and fix issues; Data Integration Tools unify data from different sources; NLP Tools extract structured information from text; and ML Tools automate cleaning tasks. Key techniques include handling outliers and addressing missing data. Companies like Coca-Cola, Adidas, Dasa, and DaVita demonstrate the power of clean data in generative AI applications, including product innovation, knowledge management, diagnostics, and patient care. Generative AI also has potential in power systems for optimization and prediction.

Using clean data improves accuracy, reduces bias, enhances efficiency, increases trust, and fosters innovation. Conversely, unclean data leads to inaccurate outputs, bias, reduced productivity, and eroded trust. Identifying data quality issues in generative AI is challenging, necessitating robust data observability. Investing in data quality is crucial for successful generative AI initiatives, leading to accurate insights and better decisions. As Soumya Seetharam of Corning stated, "Data cleanliness is a big deal." Prioritizing data cleaning and modernization is vital to leverage generative AI's transformative potential.

Being a “Data Scientist” Is As Much About IT As It Is Analysis by Carla Gentry, aka @Data_nerd

IBM defines the data scientist as -> A data scientist represents an evolution from the business or data analyst role. Data scientists of today don’t just crunch numbers; they view the universe as one large data set and work to decipher relationships in the data.

The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings to both business and IT leaders in a way that can influence how an organization approaches a business challenge.

Good data scientists will not just address business problems, they will pick the right problems that have the most value to the organization. The data scientist role has been described as “part analyst, part artist.”

Anjul Bhambhri, vice president of big data products at IBM, says, “A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.”…

A data scientist does not simply collect and report on data, but also looks at it from many angles, determines what it means, then recommends ways to apply the data.

Data scientists are inquisitive: exploring, asking questions, doing “what if” analysis, questioning existing assumptions and processes. Armed with data and analytical results, a top-tier data scientist will then communicate informed conclusions and recommendations across an organization’s leadership structure.

IBM hits the nail on the head with the above definition. Having worked with traditional data analysts as well as programmers, developers, architects, scrum masters, and data scientists — I can tell you they don’t all think alike. A data scientist could be a statistician but a statistician may not be completely ready to take on the role of data scientist, and the same goes for all the above titles as well.

Beth Schultz from All Analytics mentioned that we are like jacks of all trades but masters of none; I don’t completely agree with this comment, but do agree that my ETL skills are not as honed as my analysis skills, for example. My definition of the data scientist includes: knowledge of large databases and clones, slave, master, nodes, schemas, agile, scrum, data cleansing, ETL, SQL and other programming languages, presentation skills, Business Intelligence and Business Optimization — plus the ability to glean actionable insight from data. I could go on and on about what the data scientists needs to be familiar with, but the analysis part has to be mastered knowledge and not just general knowledge. If you want to separate the pretenders from the experienced in this business, ask a few questions about how data science actually works!

When I start working with a new data set (it doesn’t matter how much or what kind), the first question I usually ask is, what kind of servers do you own?

Why would you need to know about the servers to work with data? I ask this question so I will know what kind of load it can handle – is it going to take me 9 hours to process or 15 minutes? How many servers do you have? I ask this because if I have 4 or 5 servers, I can toggle or load balance versus having only 1 that I have to babysit.

What kind of environment will I be working in? I ask this because I need to know if they have a test environment versus a live environment, so I can play without crashing every server in the house and ticking a lot of people off. If you are working with lots of data, lower peak times or low load times are better for live, as compared to test or staging environments where you can “play” without fear. This way, you won’t “bring down the house”.

It’s a good idea for you Chief Marketing Officers (CMOs) to let your Data Scientist work in the evening hours and/or on weekends, at their homes if applicable. This, of course, requires setting up a VPN connection and it also depends on how secure the data connections are, as well as how much processing I can do before I crash them, – um, I mean, what is the speed and capacity to process? If a dial-up connection is all that’s available, forget it.

As a side note, I’ve crashed many a server in my day – how do you think I learned all this stuff? Back in the Nineties, someone would crash the mainframe at RJKA and we would all head to Einstein’s Deli in Oak Park, IL but today, this might be frowned upon. But I digress, back to more IT related things.

Another handy thing to find out is how the databases are joined. By that I mean, what variables do they have in common (i.e., “primary keys”)? Are the relationships one-to-one, one-to-many, or many-to-many? Why would you ask this? Some programmers (I don’t mean this in general) don’t completely understand relational databases, especially when it comes to transactional data and data that needs to be refreshed often. You have to set up a database like you would play chess: think at least three moves ahead.

Additionally, some programmers/developers use too many JOIN statements in their scripts, which cause large amounts of iterations. Since these tend to increase run time and are not very efficient, you don’t want to be linking too many of these babies together and then running complex algorithms or scripts.

Sometimes, it’s better to start from scratch and build your own data source. When writing scripts to extract or refresh data, don’t forget a few keys things: normalize, index, pick your design based on what you know about the data and what is being requested of it.

Servers are important, and if dealing with large databases, load balance or toggle whenever possible. Also, star schema versus snowflake schema is important, so please put some serious thought into this. Ask yourself, do I need it fast or efficient? Believe me, I always pick efficient (I am a nerd, after all) but if the client needs it ASAP, then the client shall have it ASAP.

With knowledge of the client’s IT setup from a data management/quality perspective, you’ll be equipped to handle most situations you run into when dealing with data, even if the Architect and Programmer are out sick. Your professional knowledge is going to be a big help in getting the assignment or job complete.

Happy data mining and please play with data responsibly!

About the Author

During the past 20+ years, my client list is private but I have worked with Fortune 100 and 500 companies including but not limited to, Discover Financial Services, J&J, Hershey, Kraft, Kellogg’s, SCJ, McNeil, Firestone, PBA, Disney, Deloitte, Talent Analytics, Samtec + more.

Acting as a liaison between the IT department and the Executive staff, I am able to take huge complicated databases, decipher business needs and come back with intelligence that quantifies spending, profit and trends. Being called a data nerd is a badge of courage for this curious Mathematician/Economist because knowledge is power and companies are now acknowledging its importance. To find out more about what I do, please visit my profile on LinkedIn

->https://www.linkedin.com/in/datanerd13

“Big Data needs Data Science but Data Science doesn’t need Big Data” Carla Gentry aka @data_nerd

Data science has been around for decades, and it’s not just big data. I hear a lot of people clumping these two together like they go hand-in-hand, which I agree with to an extent. However, big data needs data science but data science doesn’t necessarily need big data. Most of the data a typical company handles on a daily basis or house internally is not big data. Even Facebook and Google break up or segment their data into workable pieces. Data science is big, small, structured, unstructured, messy, clean, etc… It’s more than just analytics. As a data scientist, you’ll become a liaison between the IT department and the C suite. You have to talk both languages and you have to understand the hierarchy of data, you can’t be just an architect or data expert.

What really matters in data science is the team effort and your role as a liaison. Your company has large amounts of data and you want to make sure your queries are correct. Whatever tool you use, make sure you have your data cleansed. You want to know that it’s normalized and indexed so that things run smoother. You want to be able to give insight, which requires knowledge of your audience. If your audience is the C suite of a multi-million dollar company, you’re going to need everything you have to back up your conclusions. Be able to prove it and be prepared for questions.

What sort of personality makes for an effective data scientist?

Definitely curiosity, I remember in college, my professors shut the door if they saw me coming because telling me that a2 + b2 = C2 was never enough. I wanted to know why. So the biggest question in data science is “why?” Why is this happening? If you notice that there’s a pattern, ask “why?” Is there something wrong with the data or is this an actual pattern going on? Can we conclude anything from this pattern? A natural curiosity will definitely give you a good foundation.

For aspiring data scientists, where can they begin?

There are many positions you can get into to learn data science; it’s not just for data engineers. Personally, I started as a junior analyst. Everyone has to start at the ground floor but there are so many resources and open-source data places you can go to practice. Most IT departments aren’t going to give you access to their live database, but they may give you access to their development database where you can go in and practice. Any position that you get into, go tell your boss that you’re interested in becoming a data scientist. Sign up for courses, learn programming languages and learn business. You have to know about budgets and various business aspects, not just the analysis part and not just the IT part. Data science is a wonderful field, and I encourage anyone that has a curiosity about data analysis, hypothesizing, statistics, to give it a shot. Just know that it won’t happen overnight.

Carla Gentry is the owner and chief data scientist for Analytical Solution. Analytical Solution was founded with the aim of aiding companies without their own designated analytics department, who need analysts on a per-project contract.

Machine learning can make patterns evident but only if the data used is clean, normalized and complete. Natural language processing (NLP) is a critical part of obtaining data from documents and notes or chats. Getting natural language is one thing. Knowing what to do with it is another.

To get truly natural responses, we’ll need to improve your machine learning. With exposure to enough conversations, the computer on the other end of the line should gradually learn what the correct response should be. But that’s where the disconnect comes in, most people believe you can train a machine in days when in reality it takes months or every years for machines to completely learn their tasks.

The same applies to artificial intelligence; everyone is singing its praises without understanding what is TRULY involved in a successful implementation. Let’s briefly discuss some of the obstacles that are ignored: Success rates, according to IDG research, 96% of organizations are hindered by data challenges and 80% experience reduced productivity as a result of, technology gaps, leadership failures, lack of strategy, confusion about ownership of data and lack of experience.

Data issues that arise with AI are: Siloed data aka disparate data, technology complexity or legacy systems , lack of schema or metadata, lack of access to data, lack of ability to process the big data needed for AI to work, and finally lack of business buy in and talent / experience in the field.

Artificial Intelligence needs data to learn so all the above issues are a show stopper for sure if you aren’t prepared. As for the lack of talent, we experienced data nerds can mentor until we are blue in the face but practical applications and hands on experience are still the best ways to learn.

Abdul Razack, senior VP and head of platforms at Infosys, notes that another way to develop AI expertise is to "take a statistical programmer and training them in data strategy, or teach more statistics to someone skilled in data processing." Mathematical knowledge is foundational, Terdoslavich adds, requiring a "solid grasp of probability, statistics, linear algebra, mathematical optimization--is crucial for those who wish to develop their own algorithms or modify existing ones to fit specific purposes and constraints."

So remember before you make promises you can’t keep, machine learning, AL, NLP, etc… all require good data, communication within the team creating or designing, system compatibility, solid logical programming and MATH… It’s not just a cool buzzword and something to add to your resume or website to be deemed relevant.

Written by a @data_nerd for more check me out on LinkedIn

Women in Data

Cutting-Edge Practitioners and Their Views on Critical Skills, Background, and Education

Honored to be included with my peers!!!

Get the free ebook

Our new 2015 Edition of O'Reilly's Women in Data report reveals inspiring stories of success and insights from four women working in data, across the European Union. Now featuring a total of 19 interviews with women who are central to data businesses, authors Cornelia Lévy-Bencheton and Shannon Cutt uncover strategies for success for women in the field of data, and anyone interested in pursuing or advancing their career in data.

While women are still an underrepresented minority in the disciplines of science, technology, engineering, and math (STEM), women in data and technology are no longer outliers. With this report, you'll learn how a remarkable group of women in data achieved their current level of success, what motivated them to get there, and their views about opportunities for women in the field.

The stories in this book are inspiring, revealing insights that will widen the path for even more women in tech.

These interviews explore:

The expanding role of the contemporary data scientist
New attitudes towards women in data among Millennials
Benefits of the data and STEM fields as a career choice for women
Remedies for closing the gender gap

https://www.oreilly.com/data/free/women-in-data.csp

#WinWithAI - Influencers Perspective - TalkwithTman