Sometimes you want completely free form documents and sometimes you don’t. In the 3.2 release of MongoDB the idea of
document validation was introduced. One area I have encountered problems in the past is with dates being inserted
using different data types.
Example: (command line notation used for brevity)
The first team was using numeric dates:
The second team was using string format:
The third team was using ISO dates:
The communication issues that lead to this problem is a story for another time, but needless to say a fair amount of
data cleanup was required to fix the issue as the primary application that read the data was expecting an ISO date.
The code issue was solved by using a shared library that defined the database mapping object. Now let’s take a
look at how document validation could be used to solve this problem.
We will create a collection with the following command:
With the collection created with the validator one can now only insert using the new Date() format:
The numeric and string inserts will give the following error:
In my mind this error message leaves something to be desired. We do not know why we failed the validation, just
that the document failed.
There is another interesting side effect in the preceeding validation. Now all documents in the date collection are
required to contain the date field. What if, however, all of your documents do not contain the date field? We
can get around this problem with the following validator:
The date field is now optional, but if it exists it must be of type date.
Let’s make it a bit more interesting put an additional requirement that if the date exists it must be after
January first 2015:
By using the $or operator mixed with the implicit "$and" operator we can now logically divide the validation
requirements.
Some other interesting cases.
Integers vs floating-point values
Almost every developer new to MongoDB will make this "mistake" on the command line:
One would think that they have inserted a document containing the integer number two into the collection, when
in fact the document now contains a
floating-point value.
I can attest to the pain this can cause (especially if you are using a strongly typed language such as JAVA).
We can use the validator to safeguard against this:
The preceding insert will now give us the 121 error code and you need to use the NumberInt constructor:
Oh noes, my collection can't haz documents!!!
At the time of writing it is quite possible to mess up your validation and make the collection non-writeable.
Consider the following example:
We have made the dates collection require a date field that is both an integer and ISO date type. So who validates
the validator? At this point it appears to be you at definition time. In future releases this may be something that
MongoDB Compass addresses
Perhaps, however, all we really wanted to do was say that the date field can be either an integer or an ISO date.
In that case we can solve the problem with the following:
Almost referential integrity?
MongoDB 3.2 also introduces some new aggregation operators. One of the more interesting ones is $lookup. This
allows you to run aggregations across multiple collections. However, you do not appear to be able to hack it
into a foreign key constraint for validation (yet).
But what about performance?
I was expecting to see some slow downs on loads when using validation so I set up some load tests using
Apache JMeter. JMeter now comes with nice built in MongoDB testing tools
and I was all excited about posting pretty graphs showing how much validation slowed down the loading process.
However, there was next to no cost for using it! As such, I decided not to include these graphs but instead simply
recommend that you profile your own application as necessary.
So I have been pondering various ways to import data into MongoDB lately. While it is possible to import .json files or
.csv files directly, they do not carry the type data with them. Mongo's default behavior in this case appears to be to
store the data in the "least expensive" format. Thus fields that are intended to be longs may be stored as integers and
dates will be stored as strings, etc. If we are using a strongly typed language this can lead to issues when we retrieve
the data back out and it is not in a type that we expect.
So what are our alternatives?
Many legacy systems may send something like a .csv file with something like:
“Bob”,”Pants”,”07/14/1986”
We would then need a file descriptor of some sort to interpret the file, historically in xml:
Or, perhaps we can try to use the file header to carry the information:
The problem with this approach is that we must update the header or the meta file every time there is a change in the incoming
data and requires custom code to ingest the file and load it into MongoDB.
Using JSON files that carry their type information with them.
This is very verbose, but it works well. If we generate a contract with the consuming code on what the $types mean
then should be able to safely transmit our JSON files and have our types preserved. The layout of the data should be able to
change and we can use JSON libraries to ingest the files and load them into MongoDB and preserve our type information.
It will, however, still require some custom ingestion code
Lets take this a step further
what if we could generate a BSON file that
could be loaded directly into MongoDB? BSON is a binary JSON specification and is how MongoDB natively stores its data. The
mongodump and
mongorestore utilities generate
and consume the BSON files respectively.
There are several very important things to keep in mind.
The file structure is very important, it must be for the form: dump/database-name/collection-name.bson
Index information is carried in a file of the form: dump/database-name/collection-name.metadata.json
</ul>
The index file is interesting because we can not only transmit the type information along with the BSON file but we can
pass along the expected indexes as well. However, be very careful when adding indexes to existing collections!
Writing the bson file itself is not very difficult, the java driver includes a BSON encoder out of the box that we can use.
Here is an example that uses the
BasicBSONEncoder to write out a BSON file:
Here we are using java 7 and the Files helper class to write the encoded MongoDB DBObject to our file an object at a time.
Lets consider a use case for this strategy, extracting rows from MySQL and loading them into MongoDB.
Here we are using the meta data carried by the MySQL result set to extract keys and the values are pulled as Object types. The
encoder will then maintain the type as extracted from the MySQL database.
The meta data file
The meta data file has uses the following structure:
We can see that this JSON document is an array of indexes.
Lets take a look at how we can extract information from our MySQL table and carry that index over to our MongoDB collection.
First we will use a wrapper around DBObject to map out the key value pairs in the correct format:
Next we need to obtain the index information from the MySQL database:
This is a rather simplistic approach for pulling the index information from MySQL and more advanced
or compound indexes will require additional logic to handle.
Once we have our list of indexes we can pass it along to the meta data file writer:
Now we can create a main class to run all of our classes together and create our import file!
I hope you have enjoyed my ramblings on importing data into MongoDB. All source code found in these examples
may be found here