Introduction To Big Data Forensics
Introduction
Big Data forensics is a new type of forensics, just as
Big Data is a new way of solving the challenges presented by large, complex
data. Thanks to the growth in data and the increased value of storing more data
and analyzing it fast—Big Data solutions have become more common and more
prominently positioned within organizations. As such, the value of Big Data
systems has grown, often storing data used to drive organizational strategy,
identify sales, and many different modes of electronic communication. The
forensic value of such data is obvious: if the data is useful to an
organization, then the data is valuable to an investigation of that organization.
The information in a Big Data system is not only inherently valuable, but the
data is most likely organized and analyzed in such a way to identify how the
organization treated the data.
Big Data forensics is the forensic collection and
analysis of Big Data systems. Traditional computer forensics typically focuses
on more common sources of data, such as mobile devices and laptops. Big Data
forensics is not a replacement for traditional forensics. Instead, Big Data
forensics augments the existing forensics
body of knowledge to handle the massive, distributed systems that
require different forensic tools and techniques.
Traditional forensic tools and methods are not always
well-suited for Big Data. The tools and techniques used in traditional
forensics are most commonly designed for the collection and analysis of
unstructured data (for example, e-mail and document files). Forensics of such
data typically hinges on metadata and involves the calculation of an MD5 or
SHA-1 checksum. With Big Data systems, the large volume of data and how the
data is stored do not lend themselves well to traditional forensics. As such,
alternative methods for collecting and analyzing such data are required.
What
is Big Data?
Big data is defined as collections of data sets whose volume,velocity
or variety is so large that it is
difficult to store, manage, process and analyze the data using traditional databases
and data processing tools. In the recent years, there has been an exponential
growth in the both structured and unstructured data generated by information
technology, industrial, healthcare, Internet of Things, and other systems.
According to an estimate by IBM, 2.5 quintillion bytes of
data is created every day . A recent report by DOMO estimates the amount of
data generated every minute on popular online platforms . Below are some key
pieces of data from the report:
Ø Facebook
users share nearly 4.16 million pieces of content
Ø Twitter users send nearly 300,000 tweets
Ø Instagram users like nearly 1.73 million
photos
Ø YouTube users upload 300 hours of new video
content
Ø Apple users download nearly 51,000 apps
Ø Skype users make nearly 110,000 new calls
Ø Amazon
receives 4300 new visitors
Ø Uber
passengers take 694 rides
Ø Netflix
subscribers stream nearly 77,000 hours of video
Big Data has the potential to power next generation of
smart applications that will leverage the power of the data to make the
applications intelligent. Applications of big data span a wide range of domains
such as web, retail and marketing, banking and financial, industrial,
healthcare, environmental, Internet of Things and cyber-physical systems.
Big Data analytics
deals with collection, storage, processing and analysis of this massive scale
data. Specialized tools and frameworks are required for big data analysis when:
the volume of data involved is so large
that it is difficult to store, process and analyze data on a single machine, the velocity of data is very high and the data
needs to be analyzed in real-time, there
is variety of data involved, which can be structured, unstructured or
semi-structured, and is collected from multiple data sources, various types of analytics need to be performed
to extract value from the data such as descriptive, diagnostic, predictive and
prescriptive analytics. Big Data tools and frameworks have distributed and
parallel processing architectures and can leverage the storage and
computational resources of a large cluster of machines.
Big data analytics involves several steps starting from
data cleansing, data munging (or wrangling), data processing and visualization.
Big data analytics life-cycle starts from the collection of data from multiple
data sources. Specialized tools and frameworks are required to ingest the data from
different sources into the big data analytics backend. The data is stored in
specialized storage solutions (such as distributed file systems and
non-relational databases) which are designed to scale. Based on the analysis
requirements (batch or real-time), and type of analysis to be performed
(descriptive, diagnostic, predictive, or predictive) specialized frameworks are
used. Big data analytics is enabled by several technologies such as cloud computing,
distributed and parallel processing frameworks, non-relational databases,
in-memory computing, for instance.
Some examples of big data are listed as follows:
Ø Data
generated by social networks including text, images, audio and video data
Ø Click-stream
data generated by web applications such as e-Commerce to analyze user behaviour
Ø Machine
Sensor data collected from sensors embedded
in industrial and energy systems for monitoring their health and detecting
failures
Ø Healthcare data collected in electronic health
record (EHR) systems
Ø Logs
generated by web applications
Ø Stock
markets data
Ø Transactional
data generated by banking and financial applications.
Characteristics of Big Data
The underlying
characteristics of big data include:
1.
Volume
Big data is a form
of data whose volume is so large that it would not fit on a single machine
therefore specialized tools and frameworks are required to store process and analyze
such data. For example, social media applications process billions of messages every
day, industrial and energy systems can generate terabytes of sensor data every
day, cab aggregation applications can process millions of transactions in a
day, etc. The volumes of data generated by modern IT, industrial, healthcare,
Internet of Things, and other systems is growing exponentially driven by the
lowering costs of data storage and processing architectures and the need to
extract valuable insights from the data to improve business processes, efficiency
and service to consumers. Though there is no fixed threshold for the volume of
data to be considered as big data, however, typically, the term big data is
used for massive scale data that is difficult to store, manage and process using
traditional databases and data processing architectures.
2.
Velocity
Velocity of data refers to how fast the data is
generated. Data generated by certain sources can arrive at very high
velocities, for example, social media data or sensor data. Velocity is another
important characteristic of big data and the primary reason for the exponential
growth of data. High velocity of data results in the volume of data accumulated
to become very large, in short span of time. Some applications can have strict deadlines
for data analysis (such as trading or online fraud detection) and the data
needs to be analyzed in real-time. Specialized tools are required to ingest such high velocity data into the big data infrastructure
and analyze the data in real-time.
3.Variety
Variety refers to the forms of the data. Big data comes
in different forms such as structured, unstructured or semi-structured, including
text data, image, audio, video and sensor data. Big data systems need to be
flexible enough to handle such variety of data.
4.Veracity
Veracity refers to how accurate is the data. To extract
value from the data, the data needs to be cleaned to remove noise. Data-driven
applications can reap the benefits of big data only when the data is meaningful
and accurate. Therefore, cleansing of data is important so that incorrect and
faulty data can be filtered out.
5.Value
Value of data
refers to the usefulness of data for the intended purpose. The end goal of any
big data analytics system is to extract value from the data. The value of the
data is also related to the veracity or accuracy of the data. For some
applications value also depends on how fast we are able to process the data.
Big
Data architecture and concepts
The architectures for Big Data solutions vary greatly,
but several core concepts are shared by most solutions. Data is collected and
ingested in Big Data solutions from a multitude of sources. Big Data solutions
are designed to handle various types and formats of data, and the various types
of data can be ingested and stored together. The data ingestion system brings
the data in for transformation before the data is sent to the storage system.
Distribution of storage is important for the storage of massive data sets. No
single device can possibly store all the data or be expected to not experience
failure as a device or on one of its disks. Similarly, computational
distribution is critical for performing the analysis across large data sets
with timeliness requirements. Typically, Big Data solutions enact a
master/worker system such as MapReduce whereby one computational system acts as
the master to distribute individual analyses for the worker computational
systems to complete. The master coordinates and manages the computational tasks
and ensures that the worker systems complete
the tasks.
The following figure illustrates a high-level Big Data
architecture:
Big Data solutions utilize different types of databases
to conduct the analysis. Because Big Data can include structured,
semi-structured, and/or unstructured data, the solutions need to be capable of
performing the analysis across various types of files. Big Data solutions can
utilize both relational and nonrelational database systems.
NoSQL (Not only SQL) databases are one of the primary types of nonrelational databases used in Big Data solutions. NoSQL databases use different data structures and query languages to store and retrieve information. Key-value, graph, and document structures are used by NoSQL. These types of structures can provide a better and faster method for retrieving information about unstructured, semi-structured, and structured data.
NoSQL (Not only SQL) databases are one of the primary types of nonrelational databases used in Big Data solutions. NoSQL databases use different data structures and query languages to store and retrieve information. Key-value, graph, and document structures are used by NoSQL. These types of structures can provide a better and faster method for retrieving information about unstructured, semi-structured, and structured data.
Two additional important and related concepts for many
Big Data solutions are text analytics
and machine learning. Text analytics is the analysis of unstructured sets of
textual data. This area has grown in importance with the surge in social media content and e-mail. Customer sentiment
analysis, predictive analysis on buyer
behavior, security monitoring, and economic indicator analysis are performed on
text data by running algorithms across their data. Text analytics is largely
made possible by machine learning. Machine learning is the use of algorithms
and tools to learn from data. Machine
algorithms make decisions or predictions from data inputs without the need for
explicit algorithm instructions.
Video files and other nontraditional analysis input files
can be analyzed in a couple ways:
·
Using specialized data extraction tools
during data ingestion
·
Using
specialized techniques during analysis
In some cases, only the unstructured data's metadata is important.
In others, content from the data needs
to be captured. For example, feature extraction and object recognition
information can be captured and stored for later analysis. The needs of the Big
Data system owner dictate the types of information captured and which tools are
used to ingest, transform, and analyze the information.
Big
Data forensics
The changes to the volumes of data and the advent of Big
Data systems have changed the requirements of forensics when Big Data is
involved. Traditional forensics relies on time-consuming and interruptive
processes for collecting data. Techniques central to traditional forensic
include removing hard drives from machines containing source evidence,
calculating MD5/SHA-1 checksums, and performing physical collections that
capture all metadata. However, practical limitations with Big Data systems
prevent investigators from always applying
these techniques. The differences between traditional forensics and
forensics for Big Data are covered and
explained in this section.
One goal of any type of forensic investigation is to
reliably collect relevant evidence in a defensible manner. The evidence in a
forensic investigation is the data stored in the system. This data can be the
contents of a file, metadata, deleted files, in-memory data, hard drive slack
space, and other forms. Forensic techniques are designed to capture all relevant
information. In certain cases especially when questions about potentially
deleted information exist the entire filesystem needs to be collected using a
physical collection of every individual bit from the source system.
In other
cases, only the informational content of a source filesystem or application
system are of value. This situation arises most commonly when only structured
data systems such as databases are in question, and metadata or slack space are
irrelevant or impractical to collect. Both types of collection are equally
sound; however, the application of the type of collection depends on both
practical considerations and the types of evidence required for collection.
Big Data forensics is the identification, collection,
analysis, and presentation of the data in a Big Data system. The practical
challenges of Big Data systems aside, the goal is to collect data from
distributed filesystems, large-scale databases, and the associated
applications. Many similarities exist between traditional forensics and Big Data forensics, but the differences are
important to understand.
Metadata preservation
Metadata is any information about a file, data container, or application data that describes its attributes. Metadata provides information about the file that may be valuable when questions arise about how the file was created, modified, or deleted. Metadata can describe who altered a file, when a file was revised, and which system or application generated the data. These are crucial facts when trying to understand the life cycle and story of an individual file.
Metadata is any information about a file, data container, or application data that describes its attributes. Metadata provides information about the file that may be valuable when questions arise about how the file was created, modified, or deleted. Metadata can describe who altered a file, when a file was revised, and which system or application generated the data. These are crucial facts when trying to understand the life cycle and story of an individual file.
Metadata is not always crucial to a Big Data
investigation. Metadata is often altered or lost when data flows into and
through a Big Data system. The ingestion engines and data feeds collect the
data without preserving the metadata. The metadata would thus not provide
information about who created the data, when the data was last altered in the
upstream data source, and so on. Collecting information in these cases may not
serve a purpose. Instead, upstream information about how the data was received
can be collected as an alternative source of detail.
Investigations into Big Data systems can hinge on the
information in the data and not the metadata. Like structured data systems,
metadata does not serve a purpose when an investigation is solely based on the
content of the data. Quantitative and qualitative questions can be answered by
the data itself; metadata in that case would not be useful, so long as the
collection was performed properly and no questions exist about who imported
and/or altered the data in the Big Data system. The data within the systems is
the only source of information.
Collection
methods
Big Data systems are large, complex systems with business
requirements. As such, they may not be able to be taken offline for a forensic
investigation. In traditional forensics, systems can be taken offline, and a
collection is performed by removing the hard drive to create a forensic copy of
the data. In Big Data investigations, hundreds or thousands of storage hard
drives may be involved, and data is lost when the Big Data system is brought
offline. Also, the system may need to stay online due to business requirements.
Big Data collections usually require logical and targeted collection methods by
way of logical file forensic copies and query-based collection.
Collection
verification
Traditional forensics relies on MD5 and SHA-1 to verify
the integrity of the data collected, but it is not always feasible to use
hashing algorithms to verify Big Data collections. Both MD5 and SHA-1 are
disk-access intensive. Verifying collections by computing an MD5 or SHA-1 hash
comprises a large percentage of the time dedicated to collecting and verifying
source evidence. Spending the time to calculate the MD5 and SHA-1 for a Big
Data collection may not be feasible when many terabytes of data are collected.
The alternative is to rely on control totals, collection logs, and other
descriptive information to verify the collection.

Comments