Straw in a Haystack

Thoughts on MongoDB 3.2 Document Validation

Sometimes you want completely free form documents and sometimes you don’t. In the 3.2 release of MongoDB the idea of document validation was introduced. One area I have encountered problems in the past is with dates being inserted using different data types.

Example: (command line notation used for brevity)

The first team was using numeric dates:
```
 db.date.insert({date:20150321}) 
```
The second team was using string format:
```
 db.date.insert({date:"03/21/2015"})
```

The third team was using ISO dates:

 db.date.insert({date:new Date('Mar 21, 2015')})

The communication issues that lead to this problem is a story for another time, but needless to say a fair amount of data cleanup was required to fix the issue as the primary application that read the data was expecting an ISO date. The code issue was solved by using a shared library that defined the database mapping object. Now let’s take a look at how document validation could be used to solve this problem.

We will create a collection with the following command:

db.createCollection( "date", {
    validator: { date: {$type:"date"}}
} )

With the collection created with the validator one can now only insert using the new Date() format:

db.date.insert({date:new Date('Mar 21, 2015')})

The numeric and string inserts will give the following error:

WriteResult({
    "nInserted" : 0,
    "writeError" : {
        "code" : 121,
        "errmsg" : "Document failed validation"
    }
})

In my mind this error message leaves something to be desired. We do not know why we failed the validation, just that the document failed.

There is another interesting side effect in the preceeding validation. Now all documents in the date collection are required to contain the date field. What if, however, all of your documents do not contain the date field? We can get around this problem with the following validator:

db.createCollection( "date", {
    validator: { $or:
        [
            { date: { $type: "date" } },
            { date: { $exists: false } }
        ]
    }
} )

The date field is now optional, but if it exists it must be of type date.

Let’s make it a bit more interesting put an additional requirement that if the date exists it must be after January first 2015:

db.createCollection( "date", {
    validator: { $or:
        [
            { date: { $type: "date",
              $gte: new Date('Jan 1, 2015') } },
            { date: { $exists: false } }
        ]
    }
} )

By using the $or operator mixed with the implicit "$and" operator we can now logically divide the validation requirements.

Some other interesting cases.

Integers vs floating-point values

Almost every developer new to MongoDB will make this "mistake" on the command line:

db.numbers.insert({number:2})

One would think that they have inserted a document containing the integer number two into the collection, when in fact the document now contains a floating-point value.

I can attest to the pain this can cause (especially if you are using a strongly typed language such as JAVA).

We can use the validator to safeguard against this:

db.createCollection( "numbers", {
    validator: { number: {$type:"int" }}
} )

The preceding insert will now give us the 121 error code and you need to use the NumberInt constructor:

db.numbers.insert({number:NumberInt(2)})

Oh noes, my collection can't haz documents!!!

At the time of writing it is quite possible to mess up your validation and make the collection non-writeable. Consider the following example:

db.createCollection( "dates", {
    validator: {
        date: {$type:"date"},
        date: {$type:"int"}
    }
} )

We have made the dates collection require a date field that is both an integer and ISO date type. So who validates the validator? At this point it appears to be you at definition time. In future releases this may be something that MongoDB Compass addresses

Perhaps, however, all we really wanted to do was say that the date field can be either an integer or an ISO date. In that case we can solve the problem with the following:

db.createCollection( "date", {
    validator: { $or:
        [
            { date: { $type: "date" } },
            { date: { $type: "int" } }
        ]
    }
} )

Almost referential integrity?

MongoDB 3.2 also introduces some new aggregation operators. One of the more interesting ones is $lookup. This allows you to run aggregations across multiple collections. However, you do not appear to be able to hack it into a foreign key constraint for validation (yet).

But what about performance?

I was expecting to see some slow downs on loads when using validation so I set up some load tests using Apache JMeter. JMeter now comes with nice built in MongoDB testing tools and I was all excited about posting pretty graphs showing how much validation slowed down the loading process. However, there was next to no cost for using it! As such, I decided not to include these graphs but instead simply recommend that you profile your own application as necessary.

I hope you have enjoyed this quick overview of MongoDB 3.2 validation. More information may be found in the Development Release Notes for 3.2.0 Release Candidate

Example of generating BSON files with java

So I have been pondering various ways to import data into MongoDB lately. While it is possible to import .json files or .csv files directly, they do not carry the type data with them. Mongo's default behavior in this case appears to be to store the data in the "least expensive" format. Thus fields that are intended to be longs may be stored as integers and dates will be stored as strings, etc. If we are using a strongly typed language this can lead to issues when we retrieve the data back out and it is not in a type that we expect.

So what are our alternatives?

Many legacy systems may send something like a .csv file with something like:

“Bob”,”Pants”,”07/14/1986”

We would then need a file descriptor of some sort to interpret the file, historically in xml:

<?xml version="1.0" encoding="UTF-8"?>
<fields>
   <field-1>
      <field-name>f_name</field-name>
      <field-type>String</field-type>
   </field-1>
   <field-2>
      <field-name>l_name</field-name>
      <field-type>String</field-type>
   </field-2>
   <field-3>
      <field-name>birthdate</field-name>
      <field-type>Date</field-type>
      <date-format>MM/dd/yyyy</date-format>
   </field-3>
</fields>

Or, perhaps we can try to use the file header to carry the information:

if_name,String”,”l_name,String”,”birthdate,Date,MM/dd/yyyy”
“Bob”,”Pants”,”07/14/1986”

The problem with this approach is that we must update the header or the meta file every time there is a change in the incoming data and requires custom code to ingest the file and load it into MongoDB.

Using JSON files that carry their type information with them.

{
   "names": {
      "f_name":"Bob",
      "l_name":"Pants",
      "$type":"String"
   },
   "dates": {
      "bday":"07/14/1986",
      "$type":"date",
      "$date_format":"MM/dd/yyyy"
   }
}

This is very verbose, but it works well. If we generate a contract with the consuming code on what the $types mean then should be able to safely transmit our JSON files and have our types preserved. The layout of the data should be able to change and we can use JSON libraries to ingest the files and load them into MongoDB and preserve our type information. It will, however, still require some custom ingestion code

Lets take this a step further

what if we could generate a BSON file that could be loaded directly into MongoDB? BSON is a binary JSON specification and is how MongoDB natively stores its data. The mongodump and mongorestore utilities generate and consume the BSON files respectively.

There are several very important things to keep in mind.

The file structure is very important, it must be for the form: dump/database-name/collection-name.bson

Index information is carried in a file of the form: dump/database-name/collection-name.metadata.json </ul>

The index file is interesting because we can not only transmit the type information along with the BSON file but we can pass along the expected indexes as well. However, be very careful when adding indexes to existing collections!

Writing the bson file itself is not very difficult, the java driver includes a BSON encoder out of the box that we can use. Here is an example that uses the BasicBSONEncoder to write out a BSON file:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import org.bson.BasicBSONEncoder;
import com.mongodb.DBObject;

public class BSONFileWriter {

   private final String path;
   private final BasicBSONEncoder encoder;

   public BSONFileWriter(String path) {
      this.path = path;
      this.encoder = new BasicBSONEncoder();
   }

   public void write(DBObject dbo) throws IOException {

      Files.write(Paths.get(path), encoder.encode(dbo),
            StandardOpenOption.CREATE, StandardOpenOption.APPEND);

   }

}

Here we are using java 7 and the Files helper class to write the encoded MongoDB DBObject to our file an object at a time.
Lets consider a use case for this strategy, extracting rows from MySQL and loading them into MongoDB.

...
public class MySqlDao implements Closeable {

   private final Connection conn;

   public MySqlDao(String connString) throws SQLException {
      conn = DriverManager.getConnection(connString);
   }

   public void exportMySqlToBSON(String query, String path)
         throws SQLException, IOException {
      BSONFileWriter bsonWriter = new BSONFileWriter(path);
      Statement st = null;
      try {
         Map<String, Object> mapper = new HashMap<String, Object>();
         st = conn.createStatement();
         ResultSet rs = st.executeQuery(query);
         // use the result set meta data to populate the keys for the hashmap
         // this will allow us to use the column names as the field keys in
         // MongoDB
         ResultSetMetaData metaData = rs.getMetaData();
         while (rs.next()) {
            for (int i = 1; i <= metaData.getColumnCount(); i++) {
               mapper.put(metaData.getColumnName(i), rs.getObject(i));
            }
            bsonWriter.write(new BasicDBObject(mapper));
         }
      } finally {
         if (st != null)
            st.close();
      }
   }
...
}

Here we are using the meta data carried by the MySQL result set to extract keys and the values are pulled as Object types. The encoder will then maintain the type as extracted from the MySQL database.

The meta data file

The meta data file has uses the following structure:

{ "indexes" : [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "test.foo" }, { "v" : 1, "key" : { "abc" : 1 }, "name" : "abc_1", "ns" : "test.foo" } ] }

We can see that this JSON document is an array of indexes.

Lets take a look at how we can extract information from our MySQL table and carry that index over to our MongoDB collection.

First we will use a wrapper around DBObject to map out the key value pairs in the correct format:

mport com.mongodb.BasicDBObject;
import com.mongodb.DBObject;

public class MetaData {
   private DBObject meta;

   public MetaData(int v, String key, int dir, String name, String ns) {
      meta = new BasicDBObject();
      meta.put("v", v);
      meta.put("key", new BasicDBObject(key, dir));
      meta.put("name", name);
      meta.put("ns", ns);

   }

   public DBObject getMetaData() {
      return meta;
   }

   public String toString() {
      return meta.toString();
   }
}

Next we need to obtain the index information from the MySQL database:

...
   public BasicDBList getIndexInfoForTable(String schema, String tableName)
         throws SQLException {
      BasicDBList rtn = new BasicDBList();
      Statement st = conn.createStatement();
      String query = "SHOW INDEX FROM %s";
      ResultSet rs = st.executeQuery(String.format(query, tableName));
      while (rs.next()) {
         MetaData md = new MetaData(1, rs.getString("COLUMN_NAME"), 1,
               rs.getString("COLUMN_NAME")+"_", schema + "." + tableName);
         rtn.add(md.getMetaData());
      }
      return rtn;
   }
...

This is a rather simplistic approach for pulling the index information from MySQL and more advanced or compound indexes will require additional logic to handle.

Once we have our list of indexes we can pass it along to the meta data file writer:

...
   public static synchronized void writeMetaDataFile(String DBName,
         String DBCollectionName, Indexizer idx) throws IOException {
      ensurePathExists(DBName, DBCollectionName);
      BufferedWriter bw = null;
      try {
         bw = new BufferedWriter(new FileWriter(String.format(
               Dumps.PATH_PATTERN, DBName, DBCollectionName, "metadata.json")));
         StringBuilder sb = new StringBuilder();
         sb.append("{ \"indexes\" : ");
         sb.append(idx.getMetaData(DBName, DBCollectionName));
         sb.append(" }");
         bw.write(sb.toString());
         bw.newLine();
      } finally {
         if (bw != null)
            bw.close();
      }
   }
...

Now we can create a main class to run all of our classes together and create our import file!

package org.simple.mysql;

import java.io.IOException;
import java.sql.SQLException;

import org.mongo.bson.Dumps;
import org.mongo.bson.MetaDataWriter;

public class MySqlDirectRunner {

   public static void main(String[] args) throws SQLException, IOException {
      Dumps.createDumpDirectories("test");
      MySqlDao dao = new MySqlDao("jdbc:mysql://localhost/test?");
      dao.exportMySqlToBSON("select * from foo", "dump/test/foo.bson");
      MetaDataWriter.writeMetaDataFile("test", "foo", new MySqlIndexizer(dao));
      dao.close();

   }
}

I hope you have enjoyed my ramblings on importing data into MongoDB. All source code found in these examples may be found here

Thoughts on MongoDB 3.2 Document Validation

Thoughts on MongoDB 3.2 Document Validation

Some other interesting cases.

But what about performance?

Generating BSON files with java

Example of generating BSON files with java

So what are our alternatives?

Using JSON files that carry their type information with them.

Lets take this a step further

The meta data file

About Me