Big Data – reduced to a buzz word



A “buzz word”, that is what data has been reduced too. “Big Data” is now a common phrase used to describe numerous counts of different types of data, social media data, point of sale data, financial data, digital and visual data…. Arg, make it stop. But what is it “really” and what makes it useful versus noise?

Over the course of my career, I have worked for companies of all sizes, with some handling data better than others; the best was actually one of the smallest (go figure). Most companies struggled to figure out what to do with the data they have versus how to get more. Retail and CPG companies that can afford all the latest and greatest BI and data mining tools usually collect and use their data very well since it’s REALLY their bread and butter and without it the competition would eat them alive. Unfortunately they aren’t usually able to “house” the data, making “real time” almost impossible. Smaller companies that have jumped into the data pool (per sae) purchase large amounts of data or gather their own live data but rarely have the insight to know what “is” or “isn’t” important. Example, I sell a 250 dollar yard trimmer, now 250 bucks is a bit steep so I know the average person is not going to buy. So, I would need someone who owned (the norm) or rented and really cared (the outliner), and someone who made above average income (the norm) or someone who saved to make the purchase (since it’s a yard trimmer, we’ll say that this is an outliner) but I only have name and email address, what can I do? Honestly, not a whole lot really, except maybe a mailing list. Say I have name, complete address and email, a little better… you could use the addresses to overlay with federal, state and local data or census data from that neighbor. That would tell you median income, average home price, etc. but without more demographic and financial data, it would still not be sufficient to deduce too much insight. So the kind of data you collect becomes more important than ever, if you want to target your customers think about what it would really take for you to get the best insight.

Next issue, when working with data, one needs to think about its quality, what do I mean by that? Is it accurate and clean data? Take a look at the number of duplicate rows of information and incomplete or N/A data fields, these are very important to note and take action on. Next, how your data is labeled and defined, the “metadata” or data dictionary of your database, it tells you if the data field is a character or numeric, the length (max 255 so watch out for those “NOTE” sections), and if applicable a short description of what the variable actually is. A unique quantifier is preferred, when working with FICA/FICO, we used SSN# but in other cases usually a client ID or purchase id, which may not be unique is used. If multiple purchases or visits, with a non-unique way of labeling, occurs this can be a headache especially when working with live data and adding into the master database. Updates in a data warehouse involve data dumps or extraction, transformation and load to merge new data in with existing data (segmentation is based on some type of quantifier, a hopefully unique variable), sounds easy (not) but it gets worse, the bigger the data the longer this process takes and we haven’t even started talking about unstructured data yet, whew. How are incomplete rows beneficial, if you are looking at web data or basket sales, it can show you were someone abandoned their shopping carts, if it’s a loan application, it can tell you where they stopped, see where I’m headed? Data entry is VERY important, a few fat fingered data sets add up fast when you are talking terabytes of data, especially when they are keys in but a multitude of people.

There is more than meets the eye to data, everyone wants it but if you want it just for the sake of having data, make sure it’s not just noise, what do I mean by noise. Data experts usually take different stances on this one; I’m the, make a mental note but remove for the sake of immediate insight, (null data does not make a pretty spreadsheet) kind of person. I take special note at the end of the evaluation or data analysis but don’t freak out trying to figure out why I have 87 records that indicate the person was over 90 years old or they made 123 dollars a year, mis-entries, errors, fat fingers… no time for them now but will contact IT to correct records later (this part is very important as well, if not corrected that is 87 wasted records and they keep coming up with each analysis).
Unstructured data, what do I mean by unstructured, all data has some type of structure… yes, but take Twitter and Facebook data, it doesn’t fit into a tabular form or model but if you manipulated it (using whatever method you choose) you can still infer insight but it’s messy and sometimes a lot of useless information i.e. Joe ate a sandwich and boy was it good, giggle. Lots to think about, tools for collection of data, tools for extraction and updating data, tools for converting unstructured data into usable information, talent to glean insight out of data. Storage used to be a big deal, but now a terabyte is 50 dollars but a data warehouse or data mart will require multiple servers or a mainframe, now there’s some money. But this is enough for you to think about for now, do you still want to build that database or start a data warehouse, if so please don’t shrug it off as begin a piece of cake, to gain insight the corrects steps are to think first, collect second. Happy Mining!