Feb 092012
 

I have three or four blog posts in the works, but nothing will happen on them until the end of the month. They’re all fairly extensive, and between trying to study for the bar exam and some personal projects done, I don’t have the time. But, I’m going to go ahead and take the time to do this post, and probably one more while the information is still fresh in my mind.

Background

As I just mentioned, I’ve been working on some personal projects. Mainly one big project with some small variations on that one. Basically, it’s just a pretty straightforward web application. Up until a couple months ago, I was using a MySQL backend. Then my buddy Travis starting blathering on about his latest infatuation with NoSQL and CouchDB/Membase. I decided to take a look, just because I had never really taken the time to learn anything about NoSQL. It seemed very useful, particularly for rapid prototyping, which is pretty close to the mode I have been in. Not having to deal with schemas and not having to write SQL queries is a godsend for that type of stuff.

I ended up comparing most of the big NoSQL DBs (like CouchDB and MongoDB), and for whatever reason I settled on CouchDB. There was probably a reason, and it would be useful for me to remember, but it was a couple of months ago so that won’t happen without reevaluating everything, and I don’t have that kind of time.

First, I paired CouchDB with PHPillow. PHPillow is, for a PHP project, one of the better designed projects I have come across. Now, that doesn’t make it great (or even good, really), but it really is pretty decent, comparatively. I quickly realized it was mostly useful for the interface with CouchDB (i.e., the querying aspect), and really didn’t work well with my program design (and probably wouldn’t work well with most properly designed applications). After reworking it back into my existing app, I finally started getting to use CouchDB. It was…glorious. Due to the nature of the project, the “schema” would (and will still) change frequently, so this sped things up greatly. Designing the views was simple and quick, mostly, and interfacing with the same. I can’t put a number on how much faster it is to develop with, but it is significant. But…

The Problems

There are two main problems I encountered that I’ll focus on:

  • Inability to “filter” or use wildcards to select results from a view
  • Pagination

But first…

A Disclaimer

Please note that, even though I’m pointing out the problems I had with CouchDB, that doesn’t mean that the problems are WITH CouchDB. It is just as likely, or perhaps MORE likely, that I didn’t do sufficient research and chose the wrong tool for the job. CouchDB is the way it is, and probably for particular reasons. In other words: CouchDB isn’t really made for the application I was using it for.

So please don’t take this as a knock on CouchDB. Even though I may use derogatory terms, it’s more due to the frustration that I caused myself, not because CouchDB sucks.

Wildcards

Let’s first set up an example:

Say you have a website that has products listed on it. The products have a few attributes, including a category, a brand, and a size. It’s a very, very basic feature that you might want to filter based on any combination of those attributes. Such as:

Find all products of ‘Brand C’ in size “Medium”. Or,

Find all products in ‘Category 1′, of any brand, and any size.

You can figure out the other possibilities. Well, with CouchDB, you create views into a product, which acts as an index. The important part of these views (or map) for this discussion are the keys. In this case, I might create a view that has an array for a key that looks like:

[product.category, product.brand, product.size]

…for each individual document in the database. So each document that gets saved would end up with an entry in that view that contains each of those elements in the key. If I want to find all products in ‘Category 1′, I can just send the server this:

['Category 1']

And it will spit back anything that matches ‘Category 1′ in the first array elements, and anything in the second and third elements. Now, what if I all products in ‘Category 1′ with ‘Brand A’? Easy:

['Category 1', 'Brand A']

That gets me what I want, in any size. So, what happens if I want to get all products in ‘Category 1′ that are size ‘Medium’? Should be something similar to:

['Category 1', *, 'Medium']

Right? No. Can’t. Be. Done. From my understanding, due to the fact this is all indexed using b-trees, it is incredibly “inefficient” to do that. Why exactly, I don’t know, since I’d have to go back and relearn b-trees and see how they’re implemented. But, needless to say, this is a huge limitation. There are ways around it, but they all really, REALLY suck. For example, you can create another view that has the size as the second element, like so:

[product.category, product.size, product.brand]

But you have to do this for almost any combination. And then you have to pick (in your code) what view you want to use based on which particular filters are being applied (which of course, could be any combination if you want your site to have that basic feature). There are other alternatives using multiple queries, the reduce function, or in-memory sorting, but none of those are acceptable for such a basic feature.

As a little more background, you can do something similar to wildcards, but the problem is, once CouchDB finds the matches in the column the “wildcard” is used, it stops on that column. In the wildcard example above, CouchDB would match all keys with ‘Category 1′ in the first column, anything in the second column, then…stop. Wouldn’t even look at the third column, so you get all values in that column.

How is this done in MongoDB?

Something like this:

db.products.find( ['category': 'Category 1','size':'Medium']);

That should get exactly what I want. Will it be as quick? Maybe not, I don’t know the differences in the underlying implementation yet. But regardless, it can actually be done, and very simply (without having to create a view/map in the first place).

Pagination

In the context of the example as above, pagination is a very basic feature. You might have a thousand products, and you don’t want to display the full list at once, so you want to paginate. Very basic. Should be easy to do, right? Not in CouchDB.

The basic pattern involves using a range query (which is what the pseudo-wildcard functionality is) and limit the results to NUM_PER_PAGE + 1. So if you want twenty results per page, you get twenty-one rows from the database. You then store the last column of the first row and the last column of the last row. The query itself would be like:

startkey=['Category 1', 'Brand A', null]&endkey=['Category 1', 'Brand A', {}]&limit=21

This effectively gets all items of ‘Brand A’ that are in ‘Category 1′, of any size, and limits it to 21 results. For comparisons, null is the smallest value, and objects are the largest, so this finds any numerical values or strings (among other things that aren’t relevant). Let’s say the first and last rows are these:

['Category 1', 'Brand A', 'Medium']
...
['Category 1', 'Brand A', 'Small']

One of the neat features of CouchDB is that it sorts the view results based on the columns, from left to right, which explains why Medium would be the first, and Small would come after it. We display 20 rows, and just save the info from the 21st. So, again, we would store the value ‘Small’, because this is the query to get the second page:

startkey=['Category 1', 'Brand A', 'Small']&endkey=['Category 1', 'Brand A', {}]&limit=21

Notice the change? Instead of starting the search with null, we start it with the value of the 21st row, and once again, limit it to the following 21 rows. The results will start with the last row of the previous query (the one we hid):

['Category 1', 'Brand A', 'Small']
...
['Category 1', 'Brand A', 'XXXLarge']

…and ends 21 rows forward. And this goes on until we get to the end. Not a huge deal, right? Well, we have to keep track of all these values, because to go to previous pages, we have to “undo” these queries. So this entire time, we have been building a list, which is now:

'Medium', 'Small', 'XXXLarge'

You can think of it as a stack. Going to previous pages requires us to pop items off the top of the stack. Oh, and one other important thing: You can’t figure out the number of pages by doing this. Okay, in reality it’s not THAT important, but it’s useful and a nice feature to have.

Now, there IS a “skip” option, which would allow you to do something similar to a normal SQL limit query. But according to the CouchDB folks, this is really, really slow, and should not be done. So, there you have it.

How is this done in MongoDB?

A lot like you would do it in SQL, actually. In SQL, you just throw a limit clause at the end of a query, which would look something like:

SELECT * FROM products LIMIT 40,20

This would get 20 rows, starting from the 40th. Simple. MongoDB is just as easy…but it requires a small amount of background.

When you use MongoDB, you make a call to find(), which returns a cursor. This may be a familiar concept to those who have used SQL before. Essentially, it points to a spot in the list of results, and allows you to traverse over them (and sometimes back). With MongoDB, the query doesn’t actually get run until you try to access the results. So, no query is run with the following code:

db.products.find( ['category': 'Category 1','size':'Medium']);

At least not until you actually try to access it. That means you can do this:

$cursor = db.products.find( ['category': 'Category 1','size':'Medium']);
$cursor->skip( 40 );
$cursor->limit( 20 );

That’s it. Basically an object oriented method of what one would do with SQL. Now all you have to remember is what page you’re on, just like most people would be familiar with. And with the PHP driver/interface, the calls to skip() and limit() return the actual cursor itself, so you can chain them:

$cursor = db.products.find( ['category': 'Category 1','size':'Medium'])->skip( 40 )->limit( 20 );

Done. Again, might it suffer from the same problems that cause the CouchDB implementation to be inefficient? Possibly, but if so, they sure as hell don’t mention it on any of the beginner documents. It may be there and I just haven’t found it though.

Miscellaneous

A few other odds and ends.

As I mentioned above, there is a PHP driver for MongoDB. It is really, really clean and well put together. Seems to be at least PDO quality. Because of this integration, it is even easier to get working. In fact, I had the core CouchDB stuff replaced in a couple of hours, and only expect it to take another couple of hours to completely replace all CouchDB references in my application. MongoDB supports ad hoc queries, as well as indexes and maps, like CouchDB, which will further speed up development.

One thing that really annoyed me about CouchDB is that it is a straight document store. You can’t group related documents together. you can accomplish the same thing with views, but it seemed a bit…dirty. With MongoDB, you have “collections”, which are basically the same thing as SQL tables. There are a bunch of other features that I haven’t gotten into yet that will make things much more enjoyable than CouchDB.

Conclusion

CouchDB probably has its place. Rather, I KNOW it has its place. But for most web applications, MongoDB seems to be the better choice. Even if CouchDB is faster, it won’t matter for most web applications. Even if MongoDB is a little bit slower than a SQL solution, it still cuts development time by A LOT. At least for prototyping, it would be well worth it, even if you have to eventually switch back to SQL. (I say that now without having actually done it, so take that with a grain of salt, but that’s my intuition.)

I don’t know why CouchDB doesn’t support many of the same features; it could be because they ARE more inefficient, and CouchDB is all about performance…or it could simply be that they don’t have time or the man-power to add them. I don’t know, so I can’t really fault CouchDB. I’m sure for an application that requires the best performance possible, CouchDB is perfect. But for a generic web application (99.9% of the applications), MongoDB seems to be a better choice. There are obviously more evaluation that needs to go into the decision for a live, customer-facing application, particularly involving things such as replication, so you shouldn’t rely solely on this review to pick one or the other. I certainly have not ruled out going back to SQL once the app is developed, but I will evaluate that when it is necessary. For my current status, it looks like MongoDB is going to be much better.

I will write another blog in the next couple of days that gives some details on how my application is set up, and what work was involved in switching over to MongoDB.

I’d also like to provide a lot of thanks to Jamie Talbot for this blog, which finally made me realize that I couldn’t do what I wanted. Nothing else I came across was very helpful. But no thanks to him for getting me digging through the actual Erlang to try to figure out why it wouldn’t work and/or how I could make it work (to no avail).

Be Sociable, Share!

  3 Responses to “CouchDB vs. MongoDB – A Practical Experience”

  1. Thanks for the post. I was really exited about all features couchdb has but unfotunately cannot use it mainly because of the wildcards… I still find the concept of a master-master http database with apps beautiful. Maybe in future there will be a fork that won’t use btree and thus allow more flexibility on this.
    Regards

  2. Hi
    a tip for anyone using couchdb and need powerful search, pagination and other things is to use it with elasticsearch and their couchdb river.
    br, and thanks for the post!

  3. Thanks mate, really a great guide. Right now i’m learning to code into mongodb, i’ll be using node.js and this really was useful. Do you have any benchmarks related to mongodb vs couchdb ?

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>