Straw in a Haystack

Example of generating BSON files with java

So I have been pondering various ways to import data into MongoDB lately. While it is possible to import .json files or .csv files directly, they do not carry the type data with them. Mongo's default behavior in this case appears to be to store the data in the "least expensive" format. Thus fields that are intended to be longs may be stored as integers and dates will be stored as strings, etc. If we are using a strongly typed language this can lead to issues when we retrieve the data back out and it is not in a type that we expect.

So what are our alternatives?

Many legacy systems may send something like a .csv file with something like:

“Bob”,”Pants”,”07/14/1986”

We would then need a file descriptor of some sort to interpret the file, historically in xml:

<?xml version="1.0" encoding="UTF-8"?>
<fields>
   <field-1>
      <field-name>f_name</field-name>
      <field-type>String</field-type>
   </field-1>
   <field-2>
      <field-name>l_name</field-name>
      <field-type>String</field-type>
   </field-2>
   <field-3>
      <field-name>birthdate</field-name>
      <field-type>Date</field-type>
      <date-format>MM/dd/yyyy</date-format>
   </field-3>
</fields>

Or, perhaps we can try to use the file header to carry the information:

if_name,String”,”l_name,String”,”birthdate,Date,MM/dd/yyyy”
“Bob”,”Pants”,”07/14/1986”

The problem with this approach is that we must update the header or the meta file every time there is a change in the incoming data and requires custom code to ingest the file and load it into MongoDB.

Using JSON files that carry their type information with them.

{
   "names": {
      "f_name":"Bob",
      "l_name":"Pants",
      "$type":"String"
   },
   "dates": {
      "bday":"07/14/1986",
      "$type":"date",
      "$date_format":"MM/dd/yyyy"
   }
}

This is very verbose, but it works well. If we generate a contract with the consuming code on what the $types mean then should be able to safely transmit our JSON files and have our types preserved. The layout of the data should be able to change and we can use JSON libraries to ingest the files and load them into MongoDB and preserve our type information. It will, however, still require some custom ingestion code

Lets take this a step further

what if we could generate a BSON file that could be loaded directly into MongoDB? BSON is a binary JSON specification and is how MongoDB natively stores its data. The mongodump and mongorestore utilities generate and consume the BSON files respectively.

There are several very important things to keep in mind.

The file structure is very important, it must be for the form: dump/database-name/collection-name.bson

Index information is carried in a file of the form: dump/database-name/collection-name.metadata.json </ul>

The index file is interesting because we can not only transmit the type information along with the BSON file but we can pass along the expected indexes as well. However, be very careful when adding indexes to existing collections!

Writing the bson file itself is not very difficult, the java driver includes a BSON encoder out of the box that we can use. Here is an example that uses the BasicBSONEncoder to write out a BSON file:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import org.bson.BasicBSONEncoder;
import com.mongodb.DBObject;

public class BSONFileWriter {

   private final String path;
   private final BasicBSONEncoder encoder;

   public BSONFileWriter(String path) {
      this.path = path;
      this.encoder = new BasicBSONEncoder();
   }

   public void write(DBObject dbo) throws IOException {

      Files.write(Paths.get(path), encoder.encode(dbo),
            StandardOpenOption.CREATE, StandardOpenOption.APPEND);

   }

}

Here we are using java 7 and the Files helper class to write the encoded MongoDB DBObject to our file an object at a time.
Lets consider a use case for this strategy, extracting rows from MySQL and loading them into MongoDB.

...
public class MySqlDao implements Closeable {

   private final Connection conn;

   public MySqlDao(String connString) throws SQLException {
      conn = DriverManager.getConnection(connString);
   }

   public void exportMySqlToBSON(String query, String path)
         throws SQLException, IOException {
      BSONFileWriter bsonWriter = new BSONFileWriter(path);
      Statement st = null;
      try {
         Map<String, Object> mapper = new HashMap<String, Object>();
         st = conn.createStatement();
         ResultSet rs = st.executeQuery(query);
         // use the result set meta data to populate the keys for the hashmap
         // this will allow us to use the column names as the field keys in
         // MongoDB
         ResultSetMetaData metaData = rs.getMetaData();
         while (rs.next()) {
            for (int i = 1; i <= metaData.getColumnCount(); i++) {
               mapper.put(metaData.getColumnName(i), rs.getObject(i));
            }
            bsonWriter.write(new BasicDBObject(mapper));
         }
      } finally {
         if (st != null)
            st.close();
      }
   }
...
}

Here we are using the meta data carried by the MySQL result set to extract keys and the values are pulled as Object types. The encoder will then maintain the type as extracted from the MySQL database.

The meta data file

The meta data file has uses the following structure:

{ "indexes" : [ { "v" : 1, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "test.foo" }, { "v" : 1, "key" : { "abc" : 1 }, "name" : "abc_1", "ns" : "test.foo" } ] }

We can see that this JSON document is an array of indexes.

Lets take a look at how we can extract information from our MySQL table and carry that index over to our MongoDB collection.

First we will use a wrapper around DBObject to map out the key value pairs in the correct format:

mport com.mongodb.BasicDBObject;
import com.mongodb.DBObject;

public class MetaData {
   private DBObject meta;

   public MetaData(int v, String key, int dir, String name, String ns) {
      meta = new BasicDBObject();
      meta.put("v", v);
      meta.put("key", new BasicDBObject(key, dir));
      meta.put("name", name);
      meta.put("ns", ns);

   }

   public DBObject getMetaData() {
      return meta;
   }

   public String toString() {
      return meta.toString();
   }
}

Next we need to obtain the index information from the MySQL database:

...
   public BasicDBList getIndexInfoForTable(String schema, String tableName)
         throws SQLException {
      BasicDBList rtn = new BasicDBList();
      Statement st = conn.createStatement();
      String query = "SHOW INDEX FROM %s";
      ResultSet rs = st.executeQuery(String.format(query, tableName));
      while (rs.next()) {
         MetaData md = new MetaData(1, rs.getString("COLUMN_NAME"), 1,
               rs.getString("COLUMN_NAME")+"_", schema + "." + tableName);
         rtn.add(md.getMetaData());
      }
      return rtn;
   }
...

This is a rather simplistic approach for pulling the index information from MySQL and more advanced or compound indexes will require additional logic to handle.

Once we have our list of indexes we can pass it along to the meta data file writer:

...
   public static synchronized void writeMetaDataFile(String DBName,
         String DBCollectionName, Indexizer idx) throws IOException {
      ensurePathExists(DBName, DBCollectionName);
      BufferedWriter bw = null;
      try {
         bw = new BufferedWriter(new FileWriter(String.format(
               Dumps.PATH_PATTERN, DBName, DBCollectionName, "metadata.json")));
         StringBuilder sb = new StringBuilder();
         sb.append("{ \"indexes\" : ");
         sb.append(idx.getMetaData(DBName, DBCollectionName));
         sb.append(" }");
         bw.write(sb.toString());
         bw.newLine();
      } finally {
         if (bw != null)
            bw.close();
      }
   }
...

Now we can create a main class to run all of our classes together and create our import file!

package org.simple.mysql;

import java.io.IOException;
import java.sql.SQLException;

import org.mongo.bson.Dumps;
import org.mongo.bson.MetaDataWriter;

public class MySqlDirectRunner {

   public static void main(String[] args) throws SQLException, IOException {
      Dumps.createDumpDirectories("test");
      MySqlDao dao = new MySqlDao("jdbc:mysql://localhost/test?");
      dao.exportMySqlToBSON("select * from foo", "dump/test/foo.bson");
      MetaDataWriter.writeMetaDataFile("test", "foo", new MySqlIndexizer(dao));
      dao.close();

   }
}

I hope you have enjoyed my ramblings on importing data into MongoDB. All source code found in these examples may be found here

Example of generating BSON files with java

So what are our alternatives?

Using JSON files that carry their type information with them.

Lets take this a step further

The meta data file

About Me