I have been tinkering around with RxJava and reactive streams lately, and I decided to revisit one of my older projects for tailing the MongoDB oplog and used RabbitMQ to broadcast out changes to a sharded MongoDB Cluster. In this installment of tailing the oplog I decided not to use RabbitMQ, instead I decided to try out Apache Flink. Flink appears to play in the same domain as Spark Streaming. I am not endorsing one over the other in this blog, and it is always important to be constantly evaluating different technologies to ensure you make the right decisions. I am using Apache Flink to wire the streaming data source (MongoDB oplog tail) and an Elasticsearch sink.
<p>All source code for this project may be found
<a href="https://github.com/JaiHirsch/flink-mingo-tail" target="new">here</a>. Also, before we get started: All code used here
is not production quality. It was not test driven and was written as a thought experiment on a rainy
Friday afternoon as a way to learn more about
<a href="http://www.reactive-streams.org" target="new">reactive streams</a>,
<a hreaf="https://github.com/ReactiveX/RxJava" target="new">RxJava</a>,
<a href ="https://flink.apache.org" target="new">Apache Flink</a>, and
<a href="https://www.elastic.co" target="new"> Elasticsearh.</a>
</p>
I will not go deep into the dirty details of oplog tailing other than to say it is much easier with the reactive driver. Lets do, however, take a quick look at getting an event publisher using the reactive driver and some of syntax differences using the new generation of the MongoDB driver.
This snippet of code is inside a loop that is getting a publisher for every replica set inside of a sharded cluster.
It is also important to understand the filters that are generated though the getQueryFilter() call:
The time stamp filter keeps us from replaying the entire oplog every time we connect, if you notice this is querying MongoDB by the host, this corresponds to each of the host's server info that comes from a the sharded cluster.
The MongoDB reactive driver uses the reactive streams standard. Reactive streams is rumored to be implemented in the Java 9 release. This posed a bit of an issue while I was trying to bind it the RxJava, but there is a great translation library that wires the two approaches together, RxJavaReactiveStreams. The bindPublisherToObservable method is called in a loop and attaches a subscriber onto an oplog tail per mongod that has been passed in. This will create a thread per mongod. The code then uses a ConcurrentHashMap to increment the number of times a given operation is observed. Once the map count is equal the replica set depth (the number of replicas per shard) the given operation has fully replicated to all replica sets. At this point the operation is placed on an ArrayBlockingQueue opsQueue and removed from the map. The opsQueue is used by the Flink data source to collect the operation and place it on the Flink work flow.
The code to run the oplog tail to elastic search replication is fairly simple:
First we create the stream execution environment, next we add the MongoDB oplog source, and finally print and Elasticsearch sinks are added. The PrintSinkFunction function was used to show how easy it is to attach multiple sinks to the execution environment.
For this example I have not fully flushed out the Elasticsearch sink. It is actually indexing the oplog documents and not creating a synchronized copy of the MongoDB collections. I decided not to fully implement the Elasticsearh code as there are already a few decent projects out there that do this. Also, I was only able to the the 1.7x Apache Flink connector for Elasticsearch working, as I unable find the 2.x connector on Maven Central.
I had quite a bit of fun taking the reactive driver for MongoDB for a spin around the block. Stream and event driven programming is something we all need to be aware of and it is just as important in the server side and big data worlds as it is in for Android and other font end development.
Jai Hirsch
Senior Systems Architect
CARFAX
jai.hirsch@gmail.com