Introduction To Big Data Forensics

Introduction

Big Data forensics is a new type of forensics, just as Big Data is a new way of solving the challenges presented by large, complex data. Thanks to the growth in data and the increased value of storing more data and analyzing it fast—Big Data solutions have become more common and more prominently positioned within organizations. As such, the value of Big Data systems has grown, often storing data used to drive organizational strategy, identify sales, and many different modes of electronic communication. The forensic value of such data is obvious: if the data is useful to an organization, then the data is valuable to an investigation of that organization. The information in a Big Data system is not only inherently valuable, but the data is most likely organized and analyzed in such a way to identify how the organization treated the data.

Big Data forensics is the forensic collection and analysis of Big Data systems. Traditional computer forensics typically focuses on more common sources of data, such as mobile devices and laptops. Big Data forensics is not a replacement for traditional forensics. Instead, Big Data forensics augments the existing forensics body of knowledge to handle the massive, distributed systems that require different forensic tools and techniques.

Traditional forensic tools and methods are not always well-suited for Big Data. The tools and techniques used in traditional forensics are most commonly designed for the collection and analysis of unstructured data (for example, e-mail and document files). Forensics of such data typically hinges on metadata and involves the calculation of an MD5 or SHA-1 checksum. With Big Data systems, the large volume of data and how the data is stored do not lend themselves well to traditional forensics. As such, alternative methods for collecting and analyzing such data are required.

What is Big Data?

Big data is deﬁned as collections of data sets whose volume,velocity or variety is so large that it is difﬁcult to store, manage, process and analyze the data using traditional databases and data processing tools. In the recent years, there has been an exponential growth in the both structured and unstructured data generated by information technology, industrial, healthcare, Internet of Things, and other systems.

According to an estimate by IBM, 2.5 quintillion bytes of data is created every day . A recent report by DOMO estimates the amount of data generated every minute on popular online platforms . Below are some key pieces of data from the report:

Ø Facebook users share nearly 4.16 million pieces of content

Ø Twitter users send nearly 300,000 tweets

Ø Instagram users like nearly 1.73 million photos

Ø YouTube users upload 300 hours of new video content

Ø Apple users download nearly 51,000 apps

Ø Skype users make nearly 110,000 new calls

Ø Amazon receives 4300 new visitors

Ø Uber passengers take 694 rides

Ø Netﬂix subscribers stream nearly 77,000 hours of video

Big Data has the potential to power next generation of smart applications that will leverage the power of the data to make the applications intelligent. Applications of big data span a wide range of domains such as web, retail and marketing, banking and ﬁnancial, industrial, healthcare, environmental, Internet of Things and cyber-physical systems.

Big Data analytics deals with collection, storage, processing and analysis of this massive scale data. Specialized tools and frameworks are required for big data analysis when: the volume of data involved is so large that it is difficult to store, process and analyze data on a single machine, the velocity of data is very high and the data needs to be analyzed in real-time, there is variety of data involved, which can be structured, unstructured or semi-structured, and is collected from multiple data sources, various types of analytics need to be performed to extract value from the data such as descriptive, diagnostic, predictive and prescriptive analytics. Big Data tools and frameworks have distributed and parallel processing architectures and can leverage the storage and computational resources of a large cluster of machines.

Big data analytics involves several steps starting from data cleansing, data munging (or wrangling), data processing and visualization. Big data analytics life-cycle starts from the collection of data from multiple data sources. Specialized tools and frameworks are required to ingest the data from different sources into the big data analytics backend. The data is stored in specialized storage solutions (such as distributed ﬁle systems and non-relational databases) which are designed to scale. Based on the analysis requirements (batch or real-time), and type of analysis to be performed (descriptive, diagnostic, predictive, or predictive) specialized frameworks are used. Big data analytics is enabled by several technologies such as cloud computing, distributed and parallel processing frameworks, non-relational databases, in-memory computing, for instance.

Some examples of big data are listed as follows:

Ø Data generated by social networks including text, images, audio and video data

Ø Click-stream data generated by web applications such as e-Commerce to analyze user behaviour

Ø Machine Sensor data collected from sensors embedded in industrial and energy systems for monitoring their health and detecting failures

Ø Healthcare data collected in electronic health record (EHR) systems

Ø Logs generated by web applications

Ø Stock markets data

Ø Transactional data generated by banking and ﬁnancial applications.

Characteristics of Big Data

The underlying characteristics of big data include:

1. Volume

Big data is a form of data whose volume is so large that it would not ﬁt on a single machine therefore specialized tools and frameworks are required to store process and analyze such data. For example, social media applications process billions of messages every day, industrial and energy systems can generate terabytes of sensor data every day, cab aggregation applications can process millions of transactions in a day, etc. The volumes of data generated by modern IT, industrial, healthcare, Internet of Things, and other systems is growing exponentially driven by the lowering costs of data storage and processing architectures and the need to extract valuable insights from the data to improve business processes, efficiency and service to consumers. Though there is no ﬁxed threshold for the volume of data to be considered as big data, however, typically, the term big data is used for massive scale data that is difﬁcult to store, manage and process using traditional databases and data processing architectures.

2. Velocity

Velocity of data refers to how fast the data is generated. Data generated by certain sources can arrive at very high velocities, for example, social media data or sensor data. Velocity is another important characteristic of big data and the primary reason for the exponential growth of data. High velocity of data results in the volume of data accumulated to become very large, in short span of time. Some applications can have strict deadlines for data analysis (such as trading or online fraud detection) and the data needs to be analyzed in real-time. Specialized tools are required to ingest such high velocity data into the big data infrastructure and analyze the data in real-time.

3.Variety

Variety refers to the forms of the data. Big data comes in different forms such as structured, unstructured or semi-structured, including text data, image, audio, video and sensor data. Big data systems need to be ﬂexible enough to handle such variety of data.

4.Veracity

Veracity refers to how accurate is the data. To extract value from the data, the data needs to be cleaned to remove noise. Data-driven applications can reap the beneﬁts of big data only when the data is meaningful and accurate. Therefore, cleansing of data is important so that incorrect and faulty data can be ﬁltered out.

5.Value

Value of data refers to the usefulness of data for the intended purpose. The end goal of any big data analytics system is to extract value from the data. The value of the data is also related to the veracity or accuracy of the data. For some applications value also depends on how fast we are able to process the data.

Big Data architecture and concepts

The architectures for Big Data solutions vary greatly, but several core concepts are shared by most solutions. Data is collected and ingested in Big Data solutions from a multitude of sources. Big Data solutions are designed to handle various types and formats of data, and the various types of data can be ingested and stored together. The data ingestion system brings the data in for transformation before the data is sent to the storage system.

Distribution of storage is important for the storage of massive data sets. No single device can possibly store all the data or be expected to not experience failure as a device or on one of its disks. Similarly, computational distribution is critical for performing the analysis across large data sets with timeliness requirements. Typically, Big Data solutions enact a master/worker system such as MapReduce whereby one computational system acts as the master to distribute individual analyses for the worker computational systems to complete. The master coordinates and manages the computational tasks and ensures that the worker systems complete the tasks.

The following figure illustrates a high-level Big Data architecture:

Fig. Big Data overview

Big Data solutions utilize different types of databases to conduct the analysis. Because Big Data can include structured, semi-structured, and/or unstructured data, the solutions need to be capable of performing the analysis across various types of files. Big Data solutions can utilize both relational and nonrelational database systems.
NoSQL (Not only SQL) databases are one of the primary types of nonrelational databases used in Big Data solutions. NoSQL databases use different data structures and query languages to store and retrieve information. Key-value, graph, and document structures are used by NoSQL. These types of structures can provide a better and faster method for retrieving information about unstructured, semi-structured, and structured data.

Two additional important and related concepts for many Big Data solutions are text analytics and machine learning. Text analytics is the analysis of unstructured sets of textual data. This area has grown in importance with the surge in social media content and e-mail. Customer sentiment analysis, predictive analysis on buyer behavior, security monitoring, and economic indicator analysis are performed on text data by running algorithms across their data. Text analytics is largely made possible by machine learning. Machine learning is the use of algorithms and tools to learn from data. Machine algorithms make decisions or predictions from data inputs without the need for explicit algorithm instructions.

Video files and other nontraditional analysis input files can be analyzed in a couple ways:

· Using specialized data extraction tools during data ingestion

· Using specialized techniques during analysis

In some cases, only the unstructured data's metadata is important. In others, content from the data needs to be captured. For example, feature extraction and object recognition information can be captured and stored for later analysis. The needs of the Big Data system owner dictate the types of information captured and which tools are used to ingest, transform, and analyze the information.

Big Data forensics

The changes to the volumes of data and the advent of Big Data systems have changed the requirements of forensics when Big Data is involved. Traditional forensics relies on time-consuming and interruptive processes for collecting data. Techniques central to traditional forensic include removing hard drives from machines containing source evidence, calculating MD5/SHA-1 checksums, and performing physical collections that capture all metadata. However, practical limitations with Big Data systems prevent investigators from always applying these techniques. The differences between traditional forensics and forensics for Big Data are covered and explained in this section.

One goal of any type of forensic investigation is to reliably collect relevant evidence in a defensible manner. The evidence in a forensic investigation is the data stored in the system. This data can be the contents of a file, metadata, deleted files, in-memory data, hard drive slack space, and other forms. Forensic techniques are designed to capture all relevant information. In certain cases especially when questions about potentially deleted information exist the entire filesystem needs to be collected using a physical collection of every individual bit from the source system.

In other cases, only the informational content of a source filesystem or application system are of value. This situation arises most commonly when only structured data systems such as databases are in question, and metadata or slack space are irrelevant or impractical to collect. Both types of collection are equally sound; however, the application of the type of collection depends on both practical considerations and the types of evidence required for collection.

Big Data forensics is the identification, collection, analysis, and presentation of the data in a Big Data system. The practical challenges of Big Data systems aside, the goal is to collect data from distributed filesystems, large-scale databases, and the associated applications. Many similarities exist between traditional forensics and Big Data forensics, but the differences are important to understand.

Metadata preservation
Metadata is any information about a file, data container, or application data that describes its attributes. Metadata provides information about the file that may be valuable when questions arise about how the file was created, modified, or deleted. Metadata can describe who altered a file, when a file was revised, and which system or application generated the data. These are crucial facts when trying to understand the life cycle and story of an individual file.

Metadata is not always crucial to a Big Data investigation. Metadata is often altered or lost when data flows into and through a Big Data system. The ingestion engines and data feeds collect the data without preserving the metadata. The metadata would thus not provide information about who created the data, when the data was last altered in the upstream data source, and so on. Collecting information in these cases may not serve a purpose. Instead, upstream information about how the data was received can be collected as an alternative source of detail.

Investigations into Big Data systems can hinge on the information in the data and not the metadata. Like structured data systems, metadata does not serve a purpose when an investigation is solely based on the content of the data. Quantitative and qualitative questions can be answered by the data itself; metadata in that case would not be useful, so long as the collection was performed properly and no questions exist about who imported and/or altered the data in the Big Data system. The data within the systems is the only source of information.

Collection methods

Big Data systems are large, complex systems with business requirements. As such, they may not be able to be taken offline for a forensic investigation. In traditional forensics, systems can be taken offline, and a collection is performed by removing the hard drive to create a forensic copy of the data. In Big Data investigations, hundreds or thousands of storage hard drives may be involved, and data is lost when the Big Data system is brought offline. Also, the system may need to stay online due to business requirements. Big Data collections usually require logical and targeted collection methods by way of logical file forensic copies and query-based collection.

Collection verification

Traditional forensics relies on MD5 and SHA-1 to verify the integrity of the data collected, but it is not always feasible to use hashing algorithms to verify Big Data collections. Both MD5 and SHA-1 are disk-access intensive. Verifying collections by computing an MD5 or SHA-1 hash comprises a large percentage of the time dedicated to collecting and verifying source evidence. Spending the time to calculate the MD5 and SHA-1 for a Big Data collection may not be feasible when many terabytes of data are collected. The alternative is to rely on control totals, collection logs, and other descriptive information to verify the collection.

Search This Blog

Secure Digital World (SDW)

Introduction To Big Data Forensics

Characteristics of Big Data

Comments

Popular posts from this blog

CYBER SECURITY: Improving Cyber Defense Through Coherent Joint Red Team and Blue Team

Digital Forensics: Investigation VS Security