Originally posted as an answer to What are the most challenging problems you encounter in your work as a data scientist?
Politics, process, and organization are some of the most challenging problems I have encountered as a data scientist. This is especially true for well-established organizations that are new to data science. Established organizations are not always the best at innovating and changing (see The Innovator’s Dilemma for more information). This can hinder the implementation of a successful data science effort. Here are a few of the most challenging problems I have encountered.
Centralized Information Technology: In my humble opinion, data science is valuable because it combines technology, statistics, mathematics, and business knowledge into a single role. Combining all of these responsibilities into a single role/organization is often difficult for established organizations that centralize all IT functions into a single organization. The IT organization will often push back on data science requirements because they are outside the scope of their normal requirements and workflow. It is not normal for them to have a business user creating databases, running large-scale ETL jobs, or standing up servers.
Data ownership/governance: Data science usually involves gathering and combining data from multiple sources. This is difficult to accomplish when data is scattered across the organization, and there are no clear rules governing who can access what data. This means data owners ordinarily make arbitrary decisions on who can and cannot access their data. They might be hesitant to share this data with someone who they perceive as an outsider.
Vendor-driven decision making: I have seen many vendors who promote “data science in a box” solutions. These vendors will give polished presentations that make it seem like their product can do everything and anything. Problems arise when the company gets all of their information from the vendors and does not get feedback from the end-users of the solutions.
Too much focus on technology: Big companies have a tendency to adopt the latest and greatest trend without really understanding what it is all about. Service-Oriented Architecture (SOA) is an example of a trend that is great in principle but was frequently poorly implemented. Adoption of SOA often failed, because organizations focused on technology (web services, SOAP, REST, …) and not on the architecture and process needed for SOA to be successful.
The same holds true for data science. Data science is not about Hadoop, Spark, Tableau, or any other technology. These tools are a means to an end. A successful data science implementation regularly require changes to processes, procedures, policy, and organization.
Boil the ocean approach: If you are not familiar with this phrase, “boiling the ocean” is when a task or a problem tries to solve too many things at once. I have also heard people use the phrases “trying to solve world hunger” and “trying to run before learning how to walk.” If an organization is new to data science, I recommend a small, focused piloting effort. The Lean Startup has some good advice on how to implement something like this in an existing organization.
Lack of buy-in: All of the problems I have described above will worse if you do not have buy-in from the company’s leadership. They need to understand why data science is valuable, what they are trying to achieve, and what needs to be done to achieve it.
I am sorry if you were looking for more technically oriented answers. I have come across many intellectually challenging problems in my career, but those problems are not the most difficult part of my job. Maybe it is because I enjoy the intellectual problems, but cannot always find ways to solve the people problems.