[Week 6] NPTEL Big Data Computing Assignment Answers 2023

Join Our WhatsApp Group Join Now
Join Us On Telegram Join Now

NPTEL Big Data Computing Assignment Solutions

Big Data Computing

NPTEL Big Data Computing Week 6 Assignment Answers 2023

Q1. What is Distributed K-Means Iterative Clustering?
A single-node clustering algorithm
A clustering algorithm that uses distributed computing to improve scalability
A supervised machine learning algorithm
A dimensionality reduction technique

Answer:- b

Q2. What is the primary goal of the K-Means clustering algorithm?
Classification of data points
Regression analysis
Finding the nearest neighbor for each data point
Partitioning data points into clusters based on similarity

Answer:- d

Q3. What is the main objective of using Parallel K-Means with MapReduce for Big Data Analytics?
To reduce the dimensionality of the data
To classify data points into predefined categories
To efficiently cluster large datasets in a distributed manner
To perform regression analysis on big data

[ihc-hide-content ihc_mb_type=”show” ihc_mb_who=”1,2,3″ ihc_mb_template=”1″ ]

Answer:- c

Q4. What is the primary goal of using Parallel K-Means with MapReduce in Big Data Analytics?
To perform regression analysis
To classify data points into predefined categories
To efficiently handle large-scale clustering tasks
To visualize data patterns

Answer:- c

Q5. Which of the following tasks can be best solved using Clustering?
Predicting the amount of rainfall based on various cues
Training a robot to solve a maze
Detecting fraudulent credit card transactions
All of the mentioned

Answer:- c

Q6. Identify the correct statement(s) in context of overfitting in decision trees:

Statement I: The idea of Pre-pruning is to stop tree induction before a fully grown tree is built, that perfectly fits the training data.

Statement II: The idea of Post-pruning is to grow a tree to its maximum size and then remove the nodes using a top-bottom approach.

Only statement I is true
Only statement II is true
Both statements are true
Both statements are false

Answer:- a

Q7. Identify the correct statement(s) in context of machine learning approaches:

Statement I: In supervised approaches, the target that the model is predicting is unknown or unavailable. This means that you have unlabeled data.

Statement II: In unsupervised approaches the target, which is what the model is predicting, is provided. This is referred to as having labeled data because the target is labeled for every sample that you have in your data set.

Only Statement I is true
Only Statement II is true
Both Statements are false
Both Statements are true

Answer:- b

Q8. What is the primary focus of Machine Learning?
Accessing data from databases
Extracting meaning from big data
Learning from data
Predicting future outcomes

Answer:- c

Q9. Which of the following is an essential activity in the Machine Learning process?
Writing code for specific tasks
Designing graphical user interfaces
Collecting and preprocessing data
Creating beautiful data visualizations

Answer:- d

Q10. Which distance measure calculates the distance along strictly horizontal and vertical paths, consisting of segments along the axes?
Euclidean distance
Manhattan distance
Cosine similarity
Minkowski distance

Answer:- b

[/ihc-hide-content]

NPTEL Big Data Computing Week 5 Assignment Answers 2023

1. Where are Bloom Filters generated and used in the context of HBase?

  • Bloom Filters are generated when an HFile is persisted and stored at the end of each HFile.
  • Bloom Filters are loaded into memory during HBase operations.
  • Bloom Filters allow checking on row and column levels within the HBase store.
  • Bloom Filters are useful when data is grouped and many misses are expected during reads.
Answer :- a

2. What is the primary purpose of data streaming technologies?

  • To transfer data in large, irregular chunks for batch processing.
  • To ensure that data is transferred in a lossless manner over the internet.
  • To process data as a continuous and steady stream.
  • To reduce the growth of data on the internet.
Answer :- c

3. What is the primary purpose of column families in HBase?

  • To group rows together for efficient storage.
  • To define the data type of each column.
  • To enable efficient grouping and storage of columns.
  • To encrypt columns for enhanced security.
Answer :- c

4. Which of the following statements accurately describes HBase?

  • HBase is a relational database management system (RDBMS) designed for structured data.
  • HBase is a distributed Column-oriented database built on top of the Hadoop file system.
  • HBase is a NoSQL database designed exclusively for document storage.
  • HBase is a standalone database that does not require any distributed computing framework.
Answer :- b

5. What is a “Region” in the context of HBase?

  • It refers to a single machine that holds an entire HBase table.
  • It is a small chunk of data residing in one machine, part of a cluster of machines holding one HBase table.
  • It represents the entire set of data in an HBase table.
  • It is a backup copy of an HBase table stored on a remote server.
Answer :- b

6. In HBase, __________________is a combination of row, column family, column qualifier and contains a value and a timestamp.

  • Cell
  • Stores
  • HMaster
  • Region Server
Answer :- a

7. HBase architecture has 3 main components:

  • Client, Column family, Region Server
  • Cell, Rowkey, Stores
  • HMaster, Region Server, Zookeeper
  • HMaster, Stores, Region Server
Answer :- c

8. What is the role of a Kafka broker in a Kafka cluster?

  • A Kafka broker manages the replication of messages between topics.
  • A Kafka broker allows consumers to fetch messages by topic, partition, and offset.
  • A Kafka broker is responsible for maintaining metadata about the Kafka cluster.
  • A Kafka broker is in charge of processing and transforming messages before they are consumed.
Answer :- b

9. Which of the following statements accurately describes the characteristics of batch and stream processing?

Statement 1: Batch Processing provides the ability to process and analyze data at-rest (stored data).

Statement 2: Stream Processing provides the ability to ingest, process, and analyze data in-motion in real or near-real-time.

  • Only Statement 1 is correct.
  • Only Statement 2 is correct.
  • Both Statement 1 and Statement 2 are correct.
  • Neither Statement 1 nor Statement 2 is correct.
Answer :- c

10. What is Kafka Streams primarily used for?

  • Kafka Streams is a Java library to process event streams live as they occur.
  • Kafka Streams is a SQL database for querying event data.
  • Kafka Streams is a message broker for pub-sub messaging.
  • Kafka Streams is a distributed file storage system.
Answer :- a
Course NameBig Data Computing
CategoryNPTEL Assignment Answer
Home Click Here
Join Us on TelegramClick Here

NPTEL Big Data Computing Week 3 Assignment Answers 2023

1. What is the primary goal of Spark when working with distributed collections?

Achieving real-time data processing
Distributing data across multiple clusters
Working with distributed collections as you would with local ones
Optimizing data storage

Answer :- c

2. Which of the following is NOT a type of machine learning algorithm available in Spark’s MLlib?

Logistic regression
Non-negative matrix factorization (NMF)
Decision tree classification
Convolutional neural network (CNN)

Answer :- b

3. Which of the following statements about Apache Cassandra is not accurate?

Cassandra is a distributed key-value store.
It was originally designed at Twitter.
Cassandra is intended to run in a datacenter and across data centers.
Netflix uses Cassandra to keep track of positions in videos.

Answer :- b

4. Which of the following statements is true about Apache Spark?

Statement 1: Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python, and also provides an interactive shell.

Only Statement 1 is true.
Only Statement 2 is true.
Both Statement 1 and Statement 2 are true.
Neither Statement 1 nor Statement 2 is true.

Answer :- c

5. Which of the following statements is true about Resilient Distributed Datasets (RDDs) in Apache Spark?

RDDs are not fault-tolerant but are mutable.
RDDs are fault-tolerant and mutable.
RDDs are fault-tolerant and immutable.
RDDs are not fault-tolerant and not immutable.

Answer :- c

6. Which of the following is not a NoSQL database?

HBase
Cassandra
SQL Server
None of the mentioned

Answer :- c

7. How does Apache Spark’s performance compare to Hadoop MapReduce?

Apache Spark is up to 10 times faster in memory and up to 100 times faster on disk.
Apache Spark is up to 100 times faster in memory and up to 10 times faster on disk.
Apache Spark is up to 10 times faster both in memory and on disk compared to Hadoop MapReduce.
Apache Spark is up to 100 times faster both in memory and on disk compared to Hadoop MapReduce.

Answer :- b

8. ______ leverages Spark Core fast scheduling capability to perform streaming analytics.

MLlib
Spark Streaming
GraphX
RDDs

Answer :- b

9. ________ is a distributed graph processing framework on top of Spark.

MLlib
Spark streaming
GraphX
All of the mentioned

Answer :- c

10. Which statement is incorrect in the context of Cassandra?

It is a centralized key-value store.
It is originally designed at Facebook.
It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing.

Answer :- a
Course NameBig Data Computing
CategoryNPTEL Assignment Answer
Home Click Here
Join Us on TelegramClick Here

NPTEL Big Data Computing Week 2 Assignment Answers

1. What is the primary purpose of the Map phase in the MapReduce framework?

  • Combining and aggregating data.
  • Storing intermediate results.
  • Sorting and shuffling data.
  • Applying a user-defined function to each input record.
Answer :- d

2. Which of the following statements about the components in the MapReduce framework is true?
Statement 1: The Job Tracker is hosted inside the master and it receives the job execution request from the client.
Statement 2: Task Tracker is the MapReduce component on the slave machine as there are multiple slave machines.

  • Both statements are true.
  • Only statement 1 is true.
  • Only statement 2 is true.
  • Both statements are false.
Answer :- a

3. Which of the following is the slave/worker node and holds the user data in the form of Data Blocks?

  • NameNode
  • Data block
  • Replication
  • DataNode
Answer :- d

4. The number of maps in MapReduce is usually driven by the total size of____________.

  • Inputs
  • Outputs
  • Tasks
  • None of the mentioned
Answer :- a

5. Identify the correct statement(s) in the context of YARN (Yet Another Resource Negotiator):

A. YARN is highly scalable.
B. YARN enhances a Hadoop compute cluster in many ways.
C. YARN extends the power of Hadoop to incumbent and new technologies found within the data center.

Choose the correct option:
Only statement A is correct.
Statements A and B are correct.
Statements B and C are correct.
All statements A, B, and C are correct.

Answer :- d

6. Which of the following statements accurately describe(s) the role and responsibilities of the Job Tracker in the context of Big Data computing?

A. The Job Tracker is hosted inside the master and it receives the job execution request from the client.
B. The Job Tracker breaks down big computations into smaller parts and allocates tasks to slave nodes.
C. The Job Tracker stores all the intermediate results from task execution on the master node.
D. The Job Tracker is responsible for managing the distributed file system in the cluster.

Choose the correct option:
Only statement A is correct.
Statements A and B are correct.
Statements A, B, and C are correct.
None of the statements are correct.

Answer :- b

7. Consider the pseudo-code for MapReduce’s WordCount example. Let’s now assume that you want to determine the frequency of phrases consisting of 3 words each instead of determining the frequency of single words. Which part of the (pseudo-)code do you need to adapt?

  • Only map()
  • Only reduce()
  • map() and reduce()
  • None
Answer :- a

8. How does the NameNode determine that a DataNode is active, using a mechanism known as:

  • Heartbeats
  • Datapulse
  • h-signal
  • Active-pulse
Answer :- a

9. Which function processes a key/value pair to generate a set of intermediate key/value pairs?

  • Map
  • Reduce
  • Both Map and Reduce
  • None of the mentioned
Answer :- a

10. Which of the following options correctly identifies the three main components of the YARN Scheduler in Hadoop?

  • Global Application Manager (GAM), Cluster Resource Tracker (CRT), Job Task Coordinator (JTC)
  • Resource Monitor (RM), Cluster Supervisor (CS), Task Executor (TE)
  • Global Resource Manager (RM), Per server Node Manager (NM), Per application (job) Application Master (AM)
  • Central Resource Coordinator (CRC), Node Resource Manager (NRM), Application Controller (AC)
Answer :- c
Course NameBig Data Computing
CategoryNPTEL Assignment Answer
Home Click Here
Join Us on TelegramClick Here

NPTEL Big Data Computing Week 1 Assignment Answers

1. What are the three key characteristics of Big Data, often referred to as the 3V’s, according to IBM?

  • Viscosity, Velocity, Veracity
  • Volume, Value, Variety
  • Volume, Velocity, Variety
  • Volumetric, Visceral, Vortex
Answer :- c

2. What is the primary purpose of the MapReduce programming model in processing and generating large data sets?

  • To directly process and analyze data without any intermediate steps.
  • To convert unstructured data into structured data.
  • To specify a map function for generating intermediate key/value pairs and a reduce function for merging values associated with the same key.
  • To create visualizations and graphs for large data sets.
Answer :- c

3. _____ is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

  • Flume
  • Apache Sqoop
  • Pig
  • Mahout
Answer :- a

4. What is the primary role of YARN (Yet Another Resource Manager) in the Apache Hadoop ecosystem?

  • YARN is a data storage layer for managing and storing large datasets in Hadoop clusters.
  • YARN is a programming model for processing and analyzing data in Hadoop clusters.
  • YARN is responsible for allocating system resources and scheduling tasks for applications in a Hadoop cluster.
  • YARN is a visualization tool for creating graphs and charts based on Hadoop data.
Answer :- c

5. Which of the following statements accurately describes the characteristics and functionality of HDFS (Hadoop Distributed File System)?

  • HDFS is a centralized file system designed for storing small files and achieving high-speed data processing.
  • HDFS is a programming language used for writing MapReduce applications within the Hadoop ecosystem.
  • HDFS is a distributed, scalable, and portable file system designed for storing large files across multiple machines, achieving reliability through replication.
  • HDFS is a visualization tool that generates graphs and charts based on data stored in the Hadoop ecosystem.
Answer :- c

6. Which statement accurately describes the role and design of HBase in the Hadoop stack?

  • HBase is a programming language used for writing complex data processing algorithms in the Hadoop ecosystem.
  • HBase is a data warehousing solution designed for batch processing of large datasets in Hadoop clusters.
  • HBase is a key-value store that provides fast random access to substantial datasets, making it suitable for applications requiring such access patterns.
  • HBase is a visualization tool that generates charts and graphs based on data stored in Hadoop clusters.
Answer :- c

7. ______ brings scalable parallel database technology to Hadoop and allows users to submit low latencies queries to the data that’s stored within the HDFS or the Hbase without acquiring a ton of data movement and manipulation.

  • Apache Sqoop
  • Mahout
  • Flume
  • Impala
Answer :- d

8. What is the primary purpose of ZooKeeper in a distributed system?

  • ZooKeeper is a data warehousing solution for storing and managing large datasets in a distributed cluster.
  • ZooKeeper is a programming language for developing distributed applications in a cloud environment.
  • ZooKeeper is a highly reliable distributed coordination kernel used for tasks such as distributed locking, configuration management, leadership election, and work queues.
  • ZooKeeper is a visualization tool for creating graphs and charts based on data stored in distributed systems.
Answer :- c

9. ____ is a distributed file system that stores data on a commodity machine. Providing very high aggregate bandwidth across the entire cluster.

  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • Hadoop YARN
  • Hadoop MapReduce
Answer :- b

10. Which statement accurately describes Spark MLlib?

  • Spark MLlib is a visualization tool for creating charts and graphs based on data processed in Spark clusters.
  • Spark MLlib is a programming language used for writing Spark applications in a distributed environment.
  • Spark MLlib is a distributed machine learning framework built on top of Spark Core, providing scalable machine learning algorithms and utilities for tasks such as classification, regression, clustering, and collaborative filtering.
  • Spark MLlib is a data warehousing solution for storing and querying large datasets in a Spark cluster.
Answer :- c
Course NameBig Data Computing
CategoryNPTEL Assignment Answer
Home Click Here
Join Us on TelegramClick Here

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top