data science is the future

Reasons To Use Hadoop For Data Science

Here you can learn why hadoop for data science. Over the past 10 years, large web companies such as Google, Yahoo !, Amazon and Facebook have successfully applied large-scale learning algorithms on large data sets, creating innovative data products such as systems Online advertising and recommendation engines.

Note : Click here to know about difference between data scientist and data analytics

Apache Hadoop is fast becoming a central store for large data in the enterprise and is therefore a natural platform with which enterprise computing can now apply data science to a variety of business issues such as product recommendation , Fraud detection and sentiment analysis


Data researchers love their work environment. Whether using R, SAS, Matlab or Python, they still need a laptop with plenty of memory to analyze data and build templates. In the world of large data, portable memory is never enough, and sometimes not even close.

A common approach is to use a sample of the large dataset, a large sample that can match the memory. With Hadoop, you can now perform many exploratory data analysis tasks on complete datasets without sampling. Just write a map to reduce the work, PIG or the HIVE script, run it directly on Hadoop on the complete dataset, and get the results right at your laptop.


In many cases, machine learning algorithms perform best when they have more data to learn, especially for techniques such as clustering, aberration detection, and product recommendations.

Historically, large datasets were not available or too expensive to acquire and store, and automated learning practitioners therefore needed to find innovative ways to improve models with fairly limited data sets. With Hadoop as a platform that provides a linearly scalable storage and processing capacity, you can now store ALL data in RAW format and use the complete dataset to build more accurate and accurate models.


As many data scientists will tell you, 80% of the scientific work data is typically with data acquisition, processing, cleaning and feature extraction. This “preprocessing” step transforms the raw data into a consumable format by the machine learning algorithm, typically in the form of a matrix of characteristics.

Hadoop is an ideal platform for implementing this type of pre-processing efficiently and spread over large datasets, using reduction-map or tools like PIG, HIVE and scripting languages ‚Äč‚Äčlike Python. For example, if your application involves a word processor, it is often necessary to represent data in word-vector format using TFIDF, which involves counting word frequencies on a large corpus of documents, ideal for Batch reduction.

Similarly, if your application requires joining large tables with billions of rows to create feature vectors for each data object, the bacterium or pig is very useful and efficient for this task.


It is often mentioned that Hadoop is “schema on read” as opposed to most traditional RDBMS systems that require a strict schema definition before any data can be ingeted into them.

“Schema on read” creates “data agility”: when a new data field is needed, it is not necessary to go through a long project of redesigning schema and database migration in production, To last for months. The positive impact has an impact on an organization and very quickly everyone wants to use Hadoop for its project, get the same level of agility and gain a competitive advantage for its activities and product lines.


Source : hortonworks

Leave a Reply

Your email address will not be published. Required fields are marked *