Big data and machine learning – definition, importance, differents

Table of contents

Abstract

The purpose of this survey paper is to define Big data and understand how it is different from traditional data set, what purpose it serves, the issues and challenges in Big data, what are the defining characteristics of the Big data. And one of technologies that uses Big data i.e. Machine learning is explored, and two techniques used in Machine learning are studied and compared.

Keywords: Big data, k-means, SVM, Machine learning.

Introduction

The term big data tough coined in 1990’s has been a buzz word since last decade and many big corporate companies and tech giants are trying to develop new technologies for it and investing in it. In 2011 six national departments and agencies — the National Science Foundation, NIH, the U.S. Geological Survey, DOD, DOE and the Defense Advanced Research Projects Agency — announced a joint research and development initiative that will invest more than $200 million to develop new big data tools and techniques.

So, what is Big data?

Big data as the term suggest is about dealing with large amounts of data. Everything in this world exhausts data. Big organizations are trying to collect this data to study and understand patterns of masses, climates, weather, to understand genome code and many more. Many big companies are collecting and possess large amount of data that is too voluminous or unstructured to be analyzed or processes using traditional data structure methods. This burgeoning source of data is collected from social media, online activity, sensors, videos, surveillance cameras voice recording form calls and GPS data and many ways.

The impacts of Big data can be seen all around us like google predicting the term you about to search or Amazon suggesting product for you. All of this done by gathering, studying and analyzing big chunks of data all of us exhaust.

What makes Big data so important?

A simple way to answer it would be, data-driven decisions are much better then decisions driven by intuitions. This can be archived by Big data. With so much of data collected by companies. If the companies can form and understand the patterns, the managerial decisions can be much more efficient for the companies. It is the potential in Big data to give predictive analysis that has put so much attention on it.

Issues and Challenges

There are three data types categorized in Big data:

  1. Structures data: more traditional data.
  2. Semi-structured data: HTML, XMLS.
  3. Unstructured data: video data, audio data.

This where the problem raises traditional data management techniques can process structured data and to some extent unstructured data but can’t process unstructured data and that is why traditional data management techniques can’t be used on Big data efficiently.

Relational databases are more suitable for structured data that are transactional in nature. They satisfy the ACID properties. ACID is acronym for:

  • Atomicity: A transaction is “all or nothing” when it is atomic. If any part of the transaction or the underlying system fails, the entire transaction fails.
  • Consistency: Only transactions with valid data will be performed on the database. If the data is corrupt or improper, the transaction will not complete and the data will not be written to the database.
  • Isolation: Multiple, simultaneous transactions will not interfere with each other. All valid transactions will execute until completed and in the order they were submitted for processing.
  • Durability: After the data from the transaction is written to the database, it stays there “forever.”

ACID can’t be archived by relational Databases on Big data.

Characters of Big data

Size is the first things that comes to mind when we talk about Big data, but it is not the only characteristics of Big data. Big data is characterized by three V’s. It is what differentiates Big data for being just another way of “analytics”.

Volume

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s. With the world going digital, as of 2012 the number as reached 2.5 Exabytes (2.5* 1018). With so much of data it gives companies opportunity to work with petabytes of data in single data set. Google alone process 24 petabytes of data every single day. It is not just online data, Walmart collects around 2.5 petabytes of data every hour from its costumer transactions.

Velocity

The speed of data creation, processing and retrieval is import. To make a real time or near real time prediction speed is a necessary factor. Milli-seconds data litany can put companies behind their competitors. Rapid analysis can put obvious advantage on wall street companies and main street managers.

Variety

The source data is so diverse when collecting data. For example, data collected by social media platforms include pictures videos, on which paged the user spent more time, his entire online social media activity, what most of the user are leaning towards. And that’s just one example there can sensors collecting different type of data from temperature reading to pictures and videos of samples. The data type varies from structured to semi-structured to unstructured.

Literature Review

Big data the a very good decision making, and predictive analytic tool is defined and reviewed by Davenport, Thomas H., Paul Barth, and Randy Bean in how ‘big data’ is different.

Machine learning is one the technologies that uses big data. It learns via different methods such as supervised learning, unsupervised learning and reinforcement learning. The unsupervised learning uses algorithm called k-means which is explain in “k-means++: The advantages of careful seeding.” by Arthur, David, and Sergei Vassilvitskii. In supervised learning many algorithms are used which are spoken about in Performance analysis of various supervised algorithms on big data by Unnikrishnan, Athira, Uma Narayanan, and Shelbi Joseph

In “Predict failures in production lines: A two-stage approach with clustering and supervised learning” by D. Zhang, B. Xu and J. Wood, they take unlabeled data and use k-means to make clusters of data and put it through supervised learning algorithms to predict the failures in the production line of car manufacturing.

Comparative Study

As reported by McKinsey Global Institute in the 2011 the main components and eco-system of Big data are as follows:

  • Techniques for analyzing data: A/B testing, machine learning and natural language processing.
  • Big data technologies: business intelligence, cloud computing and databases.
  • Visualization: charts, graphs and other displays of the data.
  • In this survey paper we are going to study two different algorithms used in machine learning.

Machine Learning

Machine learning is one the techniques used in Big data to analyze the data and see patterns in the heaps of data. This is how Amazon, YouTube or any online website shows predictions or related products for the users.

Three types of learning algorithms are used in machine learning

  1. Supervised Learning: in this the algorithm develops a mathematical model from given set of labeled training data which contain training examples. The examples have inputs and desired outputs. supervised algorithms include Classification algorithm and regression algorithms. Classification algorithms are used when the outcome wanted is labeled. Regression algorithms are used when out is expected within a range.
  2. Unsupervised learning: in this algorithm takes test data that is not labeled, classified or organized. The algorithms learn the commonalities in the given test data and reacts to the new data based on presence or absence of the commonalities. Unsupervised learning uses clustering. Some common clustering algorithms used in unsupervised learning.
  3. K-means
    Mixture models
    Hierarchical clustering
    OPTICS algorithm
    DBSCAN

Reinforcement learning

The basic principle is the agent learn how to behave based on interaction with the environment and seeing the results. This is used in game theory, control theory, DeepMind etc.

K-means algorithm

The k-means method is a simple and fast algorithm that attempts to locally improve an arbitrary k-means clustering. It is used to automatically partition given data set into K groups. It works as follows:

  1. It starts by selecting k initial random centers, called means.
  2. It categorizes each value to its closest mean points and new mean point is calculated based on the categorization. All the values categorized together are used to calculate new mean. It determines the new mean point.
  3. The process is iterated for a given number of time to give the cluster.

The outcome may not be optimum. Selecting different mean points at the start and running the algorithm again may yield better clusters.

This is an unsupervised learning method for categorizing the unlabeled data and making decisions based on it.

Support Vector Machine

The original SVM algorithm was invented by Vladimir N. Vapnik and Alexey Yakovlevich Chervonenkis in 1963.This is supervised learning algorithm. It is suited for extreme cases. SVM is a frontier that best segregates two classes. Given the data which has examples that that which class, among the two, it belongs to, the algorithm will develop a model to determine to which class the new data belongs to. The SVM model is a representation of the data as point in space, which are separated by a wide margin. If the given data can’t be separated properly then the data is mapped to a higher dimension.

Since SVM algorithm is supervised, it can’t be used without labels. So, at time clustering algorithms are used to label the data and then SVM (supervised learning) algorithms are used.

Comparison

Before we compare the two algorithms, it should be clear that this is not exactly apples to apples comparison. The two algorithms are very different from the core, though both are machine learning algorithms k-means algorithm is unsupervised learning algorithm and SVM is supervised learning algorithm.

The difference from the very type of data given for these algorithms. K-means is given unlabeled data, whereas SVM is given labeled data.

K-means reads the data and can make categories of data based on the commonalities(mean) and makes decision on the new data based on the commonalities. SVM operates differently it forms its model from training data set and draws a hyperplane in the space and segregates the data.

K-means is fast but can yield better results over multiple executions. SVM is slow but very decisive.

Realization and Future references:

The best Big data applications to get patterns or answers out of it even before u ask for it. Developing a Machine learning algorithms to recognize and bring out patterns that are not particularly asked for but are hidden deep in the data. There is so much of data that is collected every day that have many hidden patterns that are to be found. It may be a base case in “Predict failures in production lines: A two-stage approach with clustering and supervised learning,” by D. Zhang, B. Xu and J. Wood, but if we put unsupervised learning algorithms like k-means or even more complex algorithms and put the clusters through supervised algorithms, I believe, many unnoticed patterns in nature, in mass behavior or in any predictive field can be found.

Conclusion

Through this survey paper we have defined what big data is, how it is different and what are the characteristics of big data are. We have also explored the areas of machine learning and studied what supervised and unsupervised learning are and compared two different algorithms used in them.

References

  1. Shinde, Manisha. (2015). XML Object: Universal Data Structure for Big Data. International Journal of Research Trends and Development 2394-9333. 2. 107-113.
  2. Michel Adiba, Juan-Carlos Castrejon-Castillo, Javier Alfonso Espinosa Oviedo, Genoveva VargasSolar, José-Luis Zechinelli-Martini. Big Data Management Challenges, Approaches, Tools and their limitations. Shui Yu, Xiaodong Lin, Jelena Misic, and Xuemin Sherman Shen. Networking for Big Data, Chapman and Hall/CRC 2016, 978-1-4822-6349-7. ;lt;hal-01270335;gt;
  3. Saint John Walker (2014) Big Data: A Revolution That Will Transform How We Live, Work, and Think, International Journal of Advertising, 33:1, 181-183, DOI: 10.2501/ IJA-33-1-181-183
  4. Madden, Sam. “From databases to big data.” IEEE Internet Computing 3 (2012): 4-6.
  5. Arthur, David, and Sergei Vassilvitskii. “k-means++: The advantages of careful seeding.” Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007.
  6. Unnikrishnan, Athira, Uma Narayanan, and Shelbi Joseph. “Performance analysis of various supervised algorithms on big data.” 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS). IEEE, 2017.
  7. Davenport, Thomas H., Paul Barth, and Randy Bean. How’big data’is different. MIT Sloan Management Review, 2012.
  8. Lohr, Steve. “The age of big data.” New York Times 11.2012 (2012).
  9. McAfee, Andrew, et al. “Big data: the management revolution.” Harvard business review 90.10 (2012): 60-68.
    D. Zhang, B. Xu and J. Wood, “Predict failures in production lines: A two-stage approach with clustering and supervised learning,” 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, 2016, pp. 2070-2074.doi: 10.1109/BigData.2016.7840832
  10. Manyika, James, Chui, Michael, Brown, Brad, Bughin, Jacques, Dobbs, Richard, Roxburgh, Charles and Byers, Angela Hung Big Data: The Next Frontier for Innovation, Competition, and Productivity. , McKinsey Global Institute (2011).

Calculate the price
Make an order in advance and get the best price
Pages (550 words)
$0.00
*Price with a welcome 15% discount applied.
Pro tip: If you want to save more money and pay the lowest price, you need to set a more extended deadline.
We know how difficult it is to be a student these days. That's why our prices are one of the most affordable on the market, and there are no hidden fees.

Instead, we offer bonuses, discounts, and free services to make your experience outstanding.
How it works
Receive a 100% original paper that will pass Turnitin from a top essay writing service
step 1
Upload your instructions
Fill out the order form and provide paper details. You can even attach screenshots or add additional instructions later. If something is not clear or missing, the writer will contact you for clarification.
Pro service tips
How to get the most out of your experience with MyStudyWriters
One writer throughout the entire course
If you like the writer, you can hire them again. Just copy & paste their ID on the order form ("Preferred Writer's ID" field). This way, your vocabulary will be uniform, and the writer will be aware of your needs.
The same paper from different writers
You can order essay or any other work from two different writers to choose the best one or give another version to a friend. This can be done through the add-on "Same paper from another writer."
Copy of sources used by the writer
Our college essay writers work with ScienceDirect and other databases. They can send you articles or materials used in PDF or through screenshots. Just tick the "Copy of sources" field on the order form.
Testimonials
See why 20k+ students have chosen us as their sole writing assistance provider
Check out the latest reviews and opinions submitted by real customers worldwide and make an informed decision.
Leadership Studies
awesome work as always
Customer 452773, August 19th, 2023
Business and administrative studies
excellent, got a 100
Customer 452773, May 17th, 2023
Business and administrative studies
always perfect work and always completed early
Customer 452773, February 21st, 2023
Psychology
Thank you!
Customer 452545, February 6th, 2021
Leadership Studies
excellent job
Customer 452773, August 3rd, 2023
Philosophy
Thank you
Customer 452811, February 17th, 2024
Human Resources Management (HRM)
excellent job
Customer 452773, June 25th, 2023
Nursing
thank you so much
Customer 452749, June 10th, 2021
Nursing
I just need some minor alterations. Thanks.
Customer 452547, February 10th, 2021
Business and administrative studies
excellent paper
Customer 452773, March 3rd, 2023
Business and administrative studies
excellent work
Customer 452773, March 9th, 2023
FIN571
excellent work
Customer 452773, March 1st, 2024
11,595
Customer reviews in total
96%
Current satisfaction rate
3 pages
Average paper length
37%
Customers referred by a friend
OUR GIFT TO YOU
15% OFF your first order
Use a coupon FIRST15 and enjoy expert help with any task at the most affordable price.
Claim my 15% OFF Order in Chat
Close

Sometimes it is hard to do all the work on your own

Let us help you get a good grade on your paper. Get professional help and free up your time for more important courses. Let us handle your;

  • Dissertations and Thesis
  • Essays
  • All Assignments

  • Research papers
  • Terms Papers
  • Online Classes
Live ChatWhatsApp