Big Data technology is new to most organizations and so is awareness of the skills needed to get the best out of Big Data. To “have” these skills overnight is wishful thinking. As a result, in most organizations a large percentage of Big Data skills need to be either learned or recruited, or a little bit of both. Big volumes of data beg for analysis in order to glean correlations and inferences and to prove or disprove hypotheses. These methods point straight to Data Science. In the past, Data Science was practiced only in the academic world. Now, in order to be competitive in the marketplace, every business is expected to possess these academic skills. With one big difference – in academia, results typically did not need to be obtained very quickly, if the problems and the data were very complex. They could take their dear time – something businesses cannot afford to do; Time to Results is of paramount importance for businesses to succeed. That said, besides volume, the bigger problem is speed – meaning the velocity with which the data arrives, with which it is supposed to be worked on, and with which the insights are supposed to be provided to the decision makers. It is not only that the standard of “how much data” has changed but also “how soon” has changed dramatically as well.

Analysts of Big Data should have the following strengths:

  • Familiarity with newer statistical languages like R
  • Understanding and use of analytics modeling techniques
  • Outstanding familiarity with the data to be analyzed
  • Risk-taking mentality to experiment with data

Technical skills needed are, among others:

  • Very good understanding and experience with Open Source Software
  • Data architecting of databases with terabytes of data and growing every minute
  • Experience managing software frameworks like Hadoop; expertise in databases like noSQL, Cassandra, and HBase
  • Expertise with analytics programming languages and facilities such as very important languages R or Pig
  • Ability to manage hardware with hundreds or thousands of “small’ CPUs, for multiple terabytes of data.

Soft skills having not much to do with Big Data are needed in many organizations:

  • Understanding of the ”ins and outs” of the business
  • Understanding of the “bottom line” of the business
  • Ability to discern which analytics will answer the bottom-line questions
  • Communications skills to explain the analytics results
  • Understanding not only transactions but also interactions and observations

10 Skills of Big data slides

Skill 1. Open Source: Apache Hadoop
A Big Data processing software has to be able to disperse the data in “chunks” to a number of processors and reassemble it without losing anything in the process! The Hadoop platform is powerful, but it is a beast which requires tender loving care and appropriate feeding by skilled technicians because of its distributed storage and processing architecture. Skills with Hadoop stack-such as HDFS, MapReduce, Flume, Oozie, Hive, Pig, HBase, and YARN – are in high demand in the industry.

View Slides

Skill 2. Open Source: Apache Spark- an alternative to MapReduce
In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s multi-stage in-memory primitives provide performance up to 100 times faster for certain applications by allowing user programs to load data into a cluster’s memory and query it repeatedly. Spark could be used either within a Hadoop framework or outside it. Spark requires technical expertise to program and run.

View Slides

Skill 3. Some More Technologies: Python, Data Lake, NoSQL

  • Python
    Is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. Python supports multiple programming paradigms, including object-oriented, imperative, and functional programming or procedural styles.
  • Data Lake
    A Data Lake is a large storage repository that “holds data until it is needed.” The term was coined by the chief technology officer of  Pentaho.
  • NoSQL
    A NoSQL (originally referring to “non SQL” or “nonrelational”) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include: simplicity of  design, simpler “horizontal” scaling to clusters of machines, which is a problem for relational databases, and finer control over  availability.

View Slides

Skill 4. SQL
This is the old faithful of a programming language – almost 40 years old. It has been resurrected after a lull of the relational world. NoSQL is  used in the more complex environment of humongous data, but SQL is used for “no brainer” simple applications. And because of the impetus of organizations such as Cloudera’s Impala, SQL is almost becoming the lingua franca for the next generation of Hadoop-scale Data Ware –
houses.

View Slides

Skill 5. General-Purpose Programming Languages:
Java, C, Python, Scala General-purpose programming languages such as Java, C, Python, and Scala would be very useful for a person with an analytics background. Computer programmers with data analytics backgrounds are highly in demand. In computer software a general-purpose
programming language is a programming language designed to be used for writing software in a wide variety of application domains.

View Slides

Skill 6. Data Mining and Machine Learning

  • Data Mining
    Data Mining is the computational process of discovering patterns in large data sets (“Big Data”) involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. It is the analysis of data with the intent to discover gems of  hidden information in the vast quantity of data that has been captured in the normal course of running the business.
  • Machine Learning
    Machine learning evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.

View Slides

Skill 7. Statistical and Quantitative Analysis
This is the crux of what Big Data is all about, and its main purpose. If a person has a background in quantitative reasoning and a degree in a  field like mathematics or statistics, the person is already halfway there. If you have worked with the language R, or have used statistical software, you are a number of notches up. Quantitative background is a BIG plus. It is analysis of a situation or event by means of complex mathematical and statistical modeling. It is translation of data into information and in turn into Predictive Insight.

View Slides

Skill 8. Data Visualization
Big Data could be very complex to comprehend if one is looking only at numbers and letters. There is no comparison to comprehension by the human brain when your eyes see the “shape of your data.” Visualized representation is an interface that presents information in an easy-to-understand and easy-to-relate, often graphical way, providing users with a lot of meaningful information at a glance.

View Slides

Skill 9. Creativity
Creativity is a phenomenon whereby something new and somehow valuable is formed. No matter what software and hardware you use, in  whichever industry, your brain is invaluable. These tools listed here will be replaced with other ones in a few years. But the human brain has been developed over a few million years. The creativity potential of our brain cells is monumental. Curiosity is the key to creativity, leading to new ways of looking at Big Data. Can you tell stories based on the data and can you communicate to the appropriate audience? Do you like data and like to play with it?

View Slides

Skill 10. Problem Solving and Subject Matter Expertise

If you are equipped with the subject matter expertise, such as health, finances, telecommunications, retail, etc., and have the ability to think out of the box (look at the data differently from the way everybody else is looking at it), are not afraid of swimming against the stream, and don’t take the path of least resistance out of convenience, you are the best candidate for Big Data projects.

View Slides