Originating from roots in statistical modeling and data analysis, data scientists have backgrounds in advanced math and statistics, advanced analytics, and increasingly machine learning / AI. The focus of data scientists is, unsurprisingly, data science — that is to say, how to extract useful information from a sea of data, and how to translate business and scientific informational needs into the language of information and math. Data scientists need to be masters of statistics, probability, mathematics, and algorithms that help to glean useful insights from huge piles of information. These data scientists usually have learned programming out of necessity more than anything else in order to run programs and run advanced analysis on data. As a result, the code that data scientists have usually been tasked to write, is of a minimal nature — only as necessary to accomplish a data science task (R and Python are a common language for them to use) and work best when they are provided clean data to run advanced analytics on. A data scientist is a scientistwho creates hypothesis, runs tests and analysis of the data, and then translates their results for someone else in the organization to easily view and understand.
On the other hand, data scientists can’t perform their jobs without access to large volumes of clean data. Extracting, cleaning, and moving data is not really the role of a data scientist, but rather that of a data engineer. Data Engineers have programming and technology expertise, and have previously been involved with data integration, middleware, analytics, business data portal, and extract-transform-load (ETL) operations. The data engineer’s center of gravity and skills are focused around big data and distributed systems, with experience with programming language such as Java, Python, Scala, and scripting tools and techniques. Data engineers are challenged with the task of taking data from a wide range of systems in structured and unstructured formats, and data which are usually not “clean”, with missing fields, mismatched data types, and other data-related issues. These data engineers need to use their programming, integration, architecture, and systems skills to clean all the data and put it into a format and system that data scientists can then use to analyst, build their data models, and provide value to the organization. In this way, the role of a data engineer is an engineer who designs, builds and arranges data.