Big Data On Demand with MongoDB
MongoDB, with its scale-out architecture, can be great fit for applications that need to handle large amounts, or irregular spurts of data without a large performance overhead.
In this blog, we discuss some of the problems found in applications processing large volumes of data, how MongoDB can solve those problems, and then walk through how to implement MongoDB to solve our hypothetical data issue.
- Problems With High-Volume Data
- MongoDB Concepts
- Implementing MongoDB
- Final Thoughts
- Additional Resources
Problems With High-Volume Data
If you have a high-volume database server in a single data center serving multiple regions with a high throughput requirement, then you might discover the database is not as performant as expected. Users are complaining of long wait times when accessing data, especially users furthest away from the data center. In addition, you are struggling to keep the most recently used data in memory without the database server evicting pages from memory. What if we could bring important data closer to the user and keep infrequently used data further away while concealing where this data is coming from and keeping memory within a manageable threshold?
A Hypothetical Example
Imagine a fictitious use case of tracking the location of 18-wheel trucks for a freight shipping company, Brutus Trucking. The company owns a fleet of 5,000 18-wheelers scattered throughout the United States, and they plan on aggressive growth by adding another 5,000 trucks over the next five years.
The National Highway Traffic Safety Administration (NHTSA) has mandated that each vehicle’s speed and location are reported every six seconds along with fuel and oil levels, autopilot setting, tire pressure readings (of each tire), and dashboard alarms. An Event Data Recorder (EDR) will be used to capture this data. This data will be used for an early warning system to reduce the amount of accidents across the country. All of this data is fed into a database for analytics, and this data must be retained for a period of three years.
There are a number of problems here that need to be addressed:
- Funneling all of this data into one standalone server is obviously going to create a bottleneck. The server as well as the network will be challenged in this scenario.
- Redundancy: Loss of one server creates a single point of failure. Disaster recovery does not exist.
- Memory cannot store an indefinite amount of information.
- Trucks and analytics applications located furthest from the data center incur a significant latency penalty.
A Hypothetical Solution
What if you could partition this data into two or more regions, East and West, and provide redundancy? The data for each truck would be fed into the datacenter in closest proximity. As each piece of data is received, it would be tagged to indicate which specific region or data center it belonged to. For example, an 18-wheeler located in Seattle would send data to the West datacenter and subsequently replicated to the East data center. An analytics application in Miami could still run a report on trucks traveling the West coast using the database server in the East without querying the datacenter in the West and vice versa. This allows devices to write to the closest server and be read from anywhere.
How could we fix this? A solution to this scenario would be a sharded cluster with MongoDB. Sharding allows us to split the data and load across multiple servers, and we can add servers as necessary as the fleet expands. In addition, data from each datacenter is replicated to the other to provide redundancy.
Let’s show how we can accomplish this.
MongoDB Concepts
A few concepts first:
- Server: A server can contain multiple
mongod
processes belonging to different replica sets. Server andmongod
terms are used interchangeably. - Replica set: A collection of servers running
mongod
processes containing copies of identical data. A replica set preferably is deployed on servers from multiple data centers for redundancy and disaster recovery. Typically, a replica set contains an odd number of servers (usually three), but they can contain more or less depending on the configuration. For this example, our replica sets will have threemongod
processes. - Primary: One server of this replica can accept inserts and updates of documents. The rest of the servers are designated as secondaries.
- Secondary: A read-only server belonging to a replica set. Secondaries can be promoted to primary automatically. This can be prevented by a configuration change (discussed later).
- Shard: A shard is a subset of data that optionally can be replicated across servers. For instance, a specific field called a shard key in a document is tagged either East or West depending on what server it came in on. A collection of documents that are all tagged East is a shard.
- Query router: A query router is a process that applications connect to for retrieving and updating data. Applications are completely unaware which server(s) the data came from. Query routers run as a
mongos
process preferably on the same machine querying the database (e.g. application server). - Configuration database: To determine where specific data is, the query routers will reference a configuration database containing metadata. This database is kept up to date by a
mongod
process and is deployed as a replica set.
Implementing MongoDB For Big Data
We will install and configure 12 servers in three data centers using MongoDB 4.0 Community Edition on CentOS 7. You might ask “Where am I going to get 12 servers?” I used Vagrant (version 2.2.3) which is an open source virtual software environment allowing you to spawn multiple instances of CentOS. It is available for a number of operating systems and you can download the Vagrant files used in this article.
If you are using git, then you can clone the repository:
git clone
https://github.com/wcrowell/mongodb-vagrant-demo.git
The instructions will be noted when you can skip a section if you are using Vagrant.d.
A few important notes:
- It is advised to use logical names instead of the actual names of a server in case a server needs to be replaced.
- I would advise keeping ports 27017 (default Mongo port) and 27018 open between data centers.
Here is a list of the machines created (with Vagrant):
- mongodb-config-east-prod1:
mongod
process running the configuration database and the query router (mongos
). - mongodb-config-central-prod1:
mongod
process running the configuration database and the query router (mongos
). - mongodb-config-west-prod1:
mongod
process running the configuration database and the query router (mongos
). - mongodb-east-prod1: The primary
mongod
process containing shard East. This instance will be writable. - mongodb-east-prod2: The secondary
mongod
process containing shard Central. This instance will be read-only. - mongodb-east-prod3: The secondary
mongod
process containing shard West. This instance will be read-only. - mongodb-central-prod1: The primary
mongod
process containing shard Central. - mongodb-central-prod2: The secondary
mongod
process containing shard West. This instance will be read-only. - mongodb-central-prod3: The secondary
mongod
process containing shard East. This instance will be read-only. - mongodb-west-prod1: The primary
mongod
process containing shard West. - mongodb-west-prod2: The secondary
mongod
process containing shard East. This instance will be read-only. - mongodb-west-prod3: The secondary
mongod
process containing shard Central. This instance will be read-only.
The completed environment should look like the following diagram:
Note: Do all operating system instructions as root.
mongo specific commands can be done as the user MongoDB runs under (e.g. mongod
).
Pre-Installation Setup
Skip to the section, Initiate the configuration database replica set if you decide to use the Vagrantfile
, since the Vagrantfile
copies these entries into /etc/hosts
for you. Add the logic names of each server to DNS or append them to the /etc/hosts
file.
sudo vi /etc/hosts
10.10.10.60 mongodb-config-east-prod1
10.10.10.61 mongodb-config-central-prod1
10.10.10.62 mongodb-config-west-prod1
10.10.10.63 mongodb-east-prod1
10.10.10.64 mongodb-east-prod2
10.10.10.65 mongodb-east-prod3
10.10.10.66 mongodb-central-prod1
10.10.10.67 mongodb-central-prod2
10.10.10.68 mongodb-central-prod3
10.10.10.69 mongodb-west-prod1
10.10.10.70 mongodb-west-prod2
10.10.10.71 mongodb-west-prod3
Note: The IP addresses/subnets can be changed according to your environment. Install MongoDB 4.0: Note: Refer to the documentation for your specific operating system. Add the MongoDB 4.0 repository: sudo vi /etc/yum.repsdos.d/mongodb-org-4.0.repo
Add the following content:
[mongodb-org-4.0]
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/4.0/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-4.0.asc
sudo yum install -y mongodb-org
Configuration Database Servers
In this step, we will be configuring the query routers and configuration database servers:
The configuration servers are a three-member replica set that contain information on where data resides at. The query routers connect to the configuration servers to query this metadata. If you are using Vagrant, go into each of the configuration server directories: mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1. Issue the following command: vagrant up
You can skip down to the Initiate the configuration database replica set if you are using the Vagrantfiles
. 1) On the configuration servers, mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1, change the /etc/mongod.conf
file to the following and replace the bindIp
to the corresponding name of the machine (e.g. mongodb-config-east-prod1
):
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
storage:
dbPath: /var/lib/mongo
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
net:
port: 27017
bindIp: mongodb-config--prod1
sharding:
clusterRole: configsvr
Where region combined with the machine name should match mongodb-config-east-prod1
, mongodb-config-central-prod1
, or mongodb-config-west-prod1
. Note: We are going to use the logical machine name here instead of an IP address. This makes the configuration ephemeral in case a machine goes down. You are more than welcome to use an IP address here. Notice the sharding.clusterRole
setting of “configsvr
”. This indicates to MongoDB this instance will be a configuration server. Do not start the configuration servers just yet. Next, we need to tell MongoDB the configuration servers are part of a replica set by changing the systemd file. Open /usr/lib/systemd/system/mongod.service
. At the end of the Environment value, add “--replSet c1
”: Environment="OPTIONS=-f /etc/mongod.conf --replSet c1"
By setting --replSet
to c1
, it means that each configuration database server will be a member of the c1 replica set. You can use another name for the replica set as long as it matches the other configuration database servers. You can also set the user and group the mongod
process runs under:
User=mongod
Group=mongod
If you do change the User and Group, then make sure to change the ExecStartPre
value as well: ExecStartPre=/usr/bin/chown mongod:mongod /var/run/mongodb
After making the changes to /usr/lib/systemd/system/mongod.service, refresh the service: sudo systemctl daemon-reload
Initiate the Configuration Database Replica Set
Start the configuration database servers (mongod
) on mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1: sudo service mongod start
From any of the mongodb-config-*-prod1 servers: mongo mongodb-config-east-prod1:27017/admin
Initiate the cluster from the JavaScript command prompt:
rs.initiate(
{
_id: "c1",
configsvr: true,
members: [
{ _id : 0, host: "mongodb-config-east-prod1:27017" },
{ _id : 1, host: "mongodb-config-central-prod1:27017" },
{ _id : 2, host: "mongodb-config-west-prod1:27017" }
]
}
)
After running the command, the JSON returned should have the following if the command was successful:
…
"ok" : 1
…
Exit from the JavaScript command prompt and open the /var/log/mongod.log. You should see the following statements indicating connectivity to the other members in the replica set:
…
2019-01-28T16:09:39.501+0000 I ASIO [Replication] Connecting to mongodb-config-central-prod1:27017
2019-01-28T16:09:39.520+0000 I REPL [replexec-26] Member mongodb-config-central-prod1:27017 is now in state STARTUP2
2019-01-28T16:09:41.526+0000 I REPL [replexec-26] Member mongodb-config-central-prod1:27017 is now in state SECONDARY
…
2019-01-28T16:09:46.598+0000 I ASIO [Replication] Connecting to mongodb-config-west-prod1:27017
2019-01-28T16:09:46.618+0000 I REPL [replexec-25] Member mongodb-config-west-prod1:27017 is now in state STARTUP2
2019-01-28T16:09:58.144+0000 I REPL [replexec-21] Member mongodb-config-west-prod1:27017 is now in state SECONDARY
…
This ends the section of configuration changes to the config servers. Next, we will configure the query routers that will connect to the config servers.
Query Routers
In this step, we will configure the query routers on the same machines where the config databases reside. Note: Normally, the query routers would be placed on an application server, and the application server’s MongoDB driver would be configured to connect to the query router. The goal is to locate the query routers as close to the querying application as possible. You can skip the following section if you are using the Vagrant files. Continue to Database servers. 1) On the configuration servers, mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1, create a /etc/mongos.conf
file and replace the bindIp
to the corresponding name of the machine (e.g. mongodb-config-east-prod1
):
systemLog:
destination: file
path: "/var/log/mongodb/mongos.log"
logAppend: true
sharding:
configDB: c1/mongodb-config-east-prod1:27017,mongodb-config-central-prod1:27017,mongodb-config-west-prod1:27017
net:
bindIp: mongodb-config--prod1
port: 27018
Where region
combined with the machine name should match mongodb-config-east-prod1
, mongodb-config-central-prod1
, or mongodb-config-west-prod1
. Create the following file: /usr/lib/systemd/system/mongos.service
[Unit]
Description=High-performance, schema-free document-oriented database
After=syslog.target
After=network.target
[Service]
User=mongod
Group=mongod
Type=forking
RuntimeDirectory=mongodb
RuntimeDirectoryMode=755
PIDFile=/var/run/mongodb/mongos.pid
ExecStart=/usr/bin/mongos --quiet \
--config /etc/mongos.conf \
--pidfilepath /var/run/mongodb/mongos.pid \
--fork
LimitFSIZE=infinity
LimitCPU=infinity
LimitAS=infinity
LimitNOFILE=64000
LimitNPROC=64000
[Install]
WantedBy=multi-user.target
Run the following command to enable the service: sudo systemctl enable mongos.service
Database Servers
East Shard
We will set up the replica set for the East Shard:
If you are using Vagrant, go into each of the East Shard server directories: mongodb-east-prod1, mongodb-central-prod1, and mongodb-west-prod1 and issue the following command: vagrant up
You can skip down to the Initiate the East Shard database replica set if you are using the Vagrantfiles
. 1) On mongodb-east-prod1, mongodb-central-prod3, and mongodb-west-prod2 servers, insert the sharding
and replication
sections (in bold) into the /etc/mongod.conf file:
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
storage:
dbPath: /var/lib/mongo
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
sharding:
clusterRole: shardsvr
replication:
oplogSizeMB: 10240
replSetName: East
net:
port: 27017
bindIp: mongodb--prod
Where region
and X
combined with the machine name should match mongodb-east-prod1
, mongodb-central-prod3
, or mongodb-west-prod2
.
Initiate the East Shard Database Replica Set
Start up the mongod service: sudo service mongod start
2) From any of the mongodb-*-prod1 servers:
mongo mongodb-east-prod1:27017/admin
rs.initiate()
rs.add("mongodb-central-prod3:27017")
rs.add("mongodb-west-prod2:27017")
cfg = rs.config()
cfg.members[2].priority = 0
rs.reconfig(cfg)
The following statement prevents a secondary (from another data center) from becoming a primary: cfg.members[2].priority = 0
Reference: https://docs.mongodb.com/manual/tutorial/configure-secondary-only-replica-set-member/#assign-priority-value-of-0
Central Shard
We will set up the replica set for the Central Shard:
If you are using Vagrant, go into each of the Central Shard server directories: mongodb-central-prod1, mongodb-west-prod3, and mongodb-east-prod2, and issue the following command: vagrant up
You can skip down to the Initiate the Central Shard database replica set if you are using the Vagrantfiles
. 1) On mongodb-central-prod1, mongodb-west-prod3, and mongodb-east-prod2, insert the sharding
and replication
sections (in bold) into the /etc/mongod.conf file:
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
storage:
dbPath: /var/lib/mongo
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
sharding:
clusterRole: shardsvr
replication:
oplogSizeMB: 10240
replSetName: Central
net:
port: 27017
bindIp: mongodb--prod
Where region
and X
combined with the machine name should match mongodb-central-prod1
, mongodb-west-prod3
, or mongodb-east-prod2
.
Initiate the Central Shard Database Replica Set
Start up the mongod service: sudo service mongod start
2) From any of the mongodb-config-*-prod1 servers:
mongo mongodb-central-prod1:27017/admin
rs.initiate()
rs.add("mongodb-west-prod3:27017")
rs.add("mongodb-east-prod2:27017")
cfg = rs.config()
cfg.members[2].priority = 0
rs.reconfig(cfg)
West Shard
We will set up the replica set for the last shard, the West:
If you are using Vagrant, go into each of the West Shard server directories: mongodb-west-prod1, mongodb-east-prod3, and mongodb-central-prod2 and issue the following command: vagrant up You can skip down to the Initiate the West Shard database replica set if you are using the Vagrantfiles
. 1) On mongodb-west-prod1, mongodb-east-prod3, and mongodb-central-prod2, insert the sharding
and replication
sections (in bold) into the /etc/mongod.conf file:
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
storage:
dbPath: /var/lib/mongo
processManagement:
fork: true
pidFilePath: /var/run/mongodb/mongod.pid
sharding:
clusterRole: shardsvr
replication:
oplogSizeMB: 10240
replSetName: West
net:
port: 27017
bindIp: mongodb--prod
Where region
and X
combined with the machine name should match mongodb-west-prod1
, mongodb-east-prod3
, or mongodb-central-prod2
.
Initiate the West Shard Database Replica Set
Start up the mongod service: sudo service mongod start
2) From any of the mongodb-config-*-prod1 servers:
mongo mongodb-west-prod1:27017/admin
rs.initiate()
rs.add("mongodb-east-prod3:27017")
rs.add("mongodb-central-prod2:27017")
cfg = rs.config()
cfg.members[2].priority = 0
rs.reconfig(cfg)
Shards
Start the query routers (mongos
) on mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1: sudo service mongos start
From mongodb-config-east-prod1 connect to the query router (mongos
): mongo mongodb-config-east-prod1:27018/admin
Note: Make sure you are connecting to the query router on port 27018 and not 27017, or you will receive the error: “no such command: 'addShard'
”
db.runCommand({addShard:"East/mongodb-east-prod1:27017", name: "East"})
db.runCommand({addShard:"Central/mongodb-central-prod1:27017", name: "Central"})
db.runCommand({addShard:"West/mongodb-west-prod1:27017", name: "West"})
We will create a database named TRUCKS
and indexes on a collection called CHECKPOINTS
.
use TRUCKS
db.CHECKPOINTS.createIndex(
{ "containerKey" : 1,
"_id" : 1
},
{"name" : "containerKey_1__id_1", background : true}
);
db.CHECKPOINTS.createIndex(
{ "containerKey" : 1,
"eventId" : 1
},
{"name" : "containerKey_1_eventId_1", background : true}
);
use admin
sh.addShardTag("East", "East")
sh.addShardTag("Central", "Central")
sh.addShardTag("West", "West")
sh.enableSharding("TRUCKS")
use TRUCKS
sh.shardCollection("TRUCKS.CHECKPOINTS", { "containerKey" : 1 })
sh.addTagRange("TRUCKS.CHECKPOINTS", { "containerKey": 100 }, { "containerKey" : 200 }, "East")
sh.addTagRange("TRUCKS.CHECKPOINTS", { "containerKey": 200 }, { "containerKey" : 300 }, "Central")
sh.addTagRange("TRUCKS.CHECKPOINTS", { "containerKey": 300 }, { "containerKey" : 400 }, "West")
Exit out of the JavaScript prompt.
Loading Test Data
Download the checkpoints.json. mongoimport --host mongodb-config-east-prod1 --port 27018 --db TRUCKS --collection CHECKPOINTS --file /common/checkpoints.json
Get back to a JavaScript prompt with the query router: mongo mongodb-config-east-prod1:27018/admin
Verify shard distribution from a mongo>
prompt: use TRUCKS db.CHECKPOINTS.getShardDistribution()
Since we inserted four documents per each shard, you should see a total of 12 documents equally distributed across the cluster:
Shard East at East/mongodb-central-prod3:27017,mongodb-east-prod1:27017,mongodb-west-prod2:27017
data : 492B docs : 4 chunks : 1
estimated data per chunk : 492B
estimated docs per chunk : 4
Shard West at West/mongodb-central-prod2:27017,mongodb-east-prod3:27017,mongodb-west-prod1:27017
data : 516B docs : 4 chunks : 2
estimated data per chunk : 258B
estimated docs per chunk : 2
Shard Central at Central/mongodb-central-prod1:27017,mongodb-east-prod2:27017,mongodb-west-prod3:27017
data : 492B docs : 4 chunks : 2
estimated data per chunk : 246B
estimated docs per chunk : 2
Totals
data : 1KiB docs : 12 chunks : 5
Shard East contains 32.8% data, 33.33% docs in cluster, avg obj size on shard : 123B
Shard West contains 34.39% data, 33.33% docs in cluster, avg obj size on shard : 129B
Shard Central contains 32.8% data, 33.33% docs in cluster, avg obj size on shard : 123B
You can verify this by connecting to mongodb-east-prod1 and verify only four documents are present: mongo mongodb-east-prod1:27017/TRUCKS
Note: If mongodb-east-prod1 is not a primary (e.g. getting NotMasterNoSlaveOk
error), then you first must allow the shell to read data by doing the following: db.setSlaveOk()
Then get the documents from the CHECKPOINTS collection: db.CHECKPOINTS.find()
This should return four documents in the East CHECKPOINTS
sharded collection.
Starting the Cluster
I would start with starting the mongod
config database servers first, then the database servers, and finally the query routers. Start the configuration database servers: Issue sudo service mongod start
listed in order: mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1 Start the database servers: Issue sudo service mongod start
listed in order: East Shard: mongodb-east-prod1 (primary), mongodb-central-prod3 (slave), and mongodb-west-prod2 (slave) Central Shard: mongodb-central-prod1 (primary), mongodb-west-prod3 (slave), and mongodb-east-prod2 (slave) West Shard: mongodb-west-prod1 (primary), mongodb-east-prod3 (slave), and mongodb-central-prod2 (slave) Finally, start the query routers (mongos) on mongodb-config-east-prod1, mongodb-config-central-prod1, and mongodb-config-west-prod1: service mongos start
Final Thoughts
For large-scale applications handling a lot of data, MongoDB can provide the means to elegantly handle that data without the performance issues present in other database types.
It's also important to note that MongoDB isn't a silver bullet for all data needs within applications. For applications that need ACID transactions, relational databases are still a necessity.
Additional Resources
Looking for additional reading regarding open source databases like MongoDB? Here are a few of our favorites:
- Blog - RDBMS vs. NoSQL
- Blog - Cassandra vs. MongoDB
- Blog - Top 3 Open Source Databases
- Support - OpenLogic Supported Open Source Databases
- Resource Collection - Intro to Open Source Databases
Get Support and Guidance for Your Open Source
Need support for your MongoDB implementation? Our OpenLogic open source support team is available 24×7 to assist you with this and other open source software!