Skip to main content

Posts tagged with 'aggregation'

This is a repost that originally appeared on the Couchbase Blog: New Querying Features in Couchbase Server 5.5.

New querying features figure prominently in the latest release of Couchbase Server 5.5. Check out the announcement and download the developer build for free right now.

In this post, I want to highlight a few of the new features and show you how to get started using them:

  • ANSI JOINs - N1QL in Couchbase already has JOIN, but now JOIN is more standards compliant and more flexible.

  • HASH joins - Performance on certain types of joins can be improved with a HASH join (in Enterprise Edition only)

  • Aggregate pushdowns - GROUP BY can be pushed down to the indexer, improving aggregation performance (in Enterprise Edition only)

All the examples in this post use the "travel-sample" bucket that comes with Couchbase.

ANSI JOINs

Until Couchbase Server 5.5, JOINs were possible, with two caveats:

  1. One side of the JOIN has to be document key(s)

  2. You must use the ON KEYS syntax

In Couchbase Server 5.5, it is no longer necessary to use ON KEYS, and so writing joins becomes much more natural and more in line with other SQL dialects.

Previous JOIN syntax

For example, here’s the old syntax:

SELECT r.destinationairport, r.sourceairport, r.distance, r.airlineid, a.name
FROM `travel-sample` r
JOIN `travel-sample` a ON KEYS r.airlineid
WHERE r.type = 'route'
AND r.sourceairport = 'CMH'
ORDER BY r.distance DESC
LIMIT 10;

This will get 10 routes that start at CMH airport, joined with their corresponding airline documents. The result are below (I’m showing them in table view, but it’s still JSON):

ANSI join results

New JOIN syntax

And here’s the new syntax doing the same thing:

SELECT r.destinationairport, r.sourceairport, r.distance, r.airlineid, a.name
FROM `travel-sample` r
JOIN `travel-sample` a ON META(a).id = r.airlineid
WHERE r.type = 'route'
AND r.sourceairport = 'CMH'
ORDER BY r.distance DESC
LIMIT 10;

The only difference is the ON. Instead of ON KEYS, it’s now ON <field1> = <field2>. It’s more natural for those coming from a relational background (like myself).

But that’s not all. Now you are no longer limited to joining just on document keys. Here’s an example of a JOIN on a city field.

SELECT a.airportname, a.city AS airportCity, h.name AS hotelName, h.city AS hotelCity, h.address AS hotelAddress
FROM `travel-sample` a
INNER JOIN `travel-sample` h ON h.city = a.city
WHERE a.type = 'airport'
AND h.type = 'hotel'
LIMIT 10;

This query will show hotels that match airports based on their city.

ANSI join on fields

Note that for this to work, you must have an index created on the field that’s on the inner side of the JOIN. The "travel-sample" bucket already contains a predefined index on the city field. If I were to attempt it with other fields, I’d get an error message like "No index available for ANSI join term…​".

For more information on ANSI JOIN, check out the full N1QL JOIN documentation.

Note: The old JOIN, ON KEYS syntax will still work, so don’t worry about having to update your old code.

Hash Joins

Under the covers, there are different ways that joins can be carried out. If you run the query above, Couchbase will use a Nested Loop (NL) approach to execute the join. However, you can also instruct Couchbase to use a hash join instead. A hash join can sometimes be more performant than a nested loop. Additionally, a hash join isn’t dependent on an index. It is, however, dependent on the join being an equality join only.

For instance, in "travel-sample", I could join landmarks to hotels on their email fields. This may not be the best find to find out if a hotel is a landmark, but since email is not indexed by default, it illustrates the point.

SELECT l.name AS landmarkName, h.name AS hotelName, l.email AS landmarkEmail, h.email AS hotelEmail
FROM `travel-sample` l
INNER JOIN `travel-sample` h ON h.email = l.email
WHERE l.type = 'landmark'
AND h.type = 'hotel';

The above query will take a very long time to run, and probably time out.

Syntax

Next I’ll try a hash join, which must be explicitly invoked with a USE HASH hint.

SELECT l.name AS landmarkName, h.name AS hotelName, l.email AS landmarkEmail, h.email AS hotelEmail
FROM `travel-sample` l
INNER JOIN `travel-sample` h USE HASH(BUILD) ON h.email = l.email
WHERE l.type = 'landmark'
AND h.type = 'hotel';

A hash join has two sides: a BUILD and a PROBE. The BUILD side of the join will be used to create an in-memory hash table. The PROBE side will use that table to find matches and perform the join. Typically, this means you want the BUILD side to be used on the smaller of the two sets. However, you can only supply one hash hint, and only to the right side of the join. So if you specify BUILD on the right side, then you are implicitly using PROBE on the left side (and vice versa).

BUILD and PROBE

So why did I use HASH(BUILD)?

I know from using INFER and/or Bucket Insights that landmarks make up roughly 10% of the data, and hotels make up about 3%. Also, I know from just trying it out that HASH(BUILD) was slightly slower. But in either case, the query execution time was milliseconds. Turns out there are two hotel-landmark pairs with the same email address.

Hash join results

USE HASH will tell Couchbase to attempt a hash join. If it cannot do so (or if you are using Couchbase Server Community Edition), it will fall back to a nested-loop. (By the way, you can explicitly specify nested-loop with the USE NL syntax, but currently there is no reason to do so).

For more information, check out the HASH join areas of the documentation.

Aggregate pushdowns

Aggregations in the past have been tricky when it comes to performance. With Couchbase Server 5.5, aggregate pushdowns are now supported for SUM, COUNT, MIN, MAX, and AVG.

In earlier versions of Couchbase, indexing was not used for statements involving GROUP BY. This could severely impact performance, because there is an extra "grouping" step that has to take place. In Couchbase Server 5.5, the index service can do the grouping/aggregation.

Example

Here’s an example query that finds the total number of hotels, and groups them by country, state, and city.

SELECT country, state, city, COUNT(1) AS total
FROM `travel-sample`
WHERE type = 'hotel' and country is not null
GROUP BY country, state, city
ORDER BY COUNT(1) DESC;

The query will execute, and it will return as a result:

Aggregation result

Let’s take a look at the visual query plan (only available in Enterprise Edition, but you can view the raw Plan Text in Community Edition).

Query plan with no pushdown

Note that the only index being used is for the type field. The grouping step is doing the aggregation work. With the relatively small travel-sample data set, this query is taking around ~90ms on my single node desktop. But let’s see what happens if I add an index on the fields that I’m grouping by:

Indexing

CREATE INDEX ix_hotelregions ON `travel-sample` (country, state, city) WHERE type='hotel';

Now, execute the above SELECT query again. It should return the same results. But:

  • It’s now taking ~7ms on my single node desktop. We’re taking ms, but with a large, more realistic data set, that is a huge difference in magnitude.

  • The query plan is different.

Query plan with pushdown

Note that this time, there is no 'group' step. All the work is being pushed down to the index service, which can use the ix_hotelregions index. It can use this index because my query is exactly matching the fields in the index.

Index push down does not always happen: your query has to meet specific conditions. For more information, check out the GROUP BY and Aggregate Performance areas of the documentation.

Summary

With Couchbase Server 5.5, N1QL includes even more standards-compliant syntax and becomes more performant than ever.

Have a question for me? I’m on Twitter @mgroves. You can also check out @N1QL on Twitter. The N1QL Forum is a good place to go if you have in-depth questions about N1QL.

This is a repost that originally appeared on the Couchbase Blog: Aggregate grouping with N1QL or with MapReduce.

Aggregate grouping is what I’m titling this blog post, but I don’t know if it’s the best name. Have you ever used MySQL’s GROUP_CONCAT function or the FOR XML PATH('') workaround in SQL Server? That’s basically what I’m writing about today. With Couchbase Server, the easiest way to do it is with N1QL’s ARRAY_AGG function, but you can also do it with an old school MapReduce View.

I’m writing this post because one of our solution engineers was working on this problem for a customer (who will go unnamed). Neither of us could find a blog post like this with the answer, so after we worked together to come up with a solution, I decided I would blog about it for my future self (which is pretty much the main reason I blog anything, really. The other reason is to find out if anyone else knows a better way).

Before we get started, I’ve made some material available if you want to follow along. The source code I used to generate the "patient" data used in this post is available on GitHub. If you aren’t .NET savvy, you can just use cbimport on sample data that I’ve created. (Or, you can use the N1QL sandbox, more information on that later). The rest of this blog post assumes you have a "patients" bucket with that sample data in it.

Requirements

I have a bucket of patient documents. Each patient has a single doctor. The patient document refers to a doctor by a field called doctorId. There may be other data in the patient document, but we’re mainly focused on the patient document’s key and the doctorId value. Some examples:

key 01257721
{
    "doctorId": 58,
    "patientName": "Robyn Kirby",
    "patientDob": "1986-05-16T19:01:52.4075881-04:00"
}

key 116wmq8i
{
    "doctorId": 8,
    "patientName": "Helen Clark",
    "patientDob": "2016-02-01T04:54:30.3505879-05:00"
}

Next, we can assume that each doctor can have multiple patients. We can also assume that a doctor document exists, but we don’t actually need that for this tutorial, so let’s just focus on the patients for now.

Finally, what we want for our application (or report or whatever), is an aggregate grouping of the patients with their doctor. Each record would identify a doctor and a list/array/collection of patients. Something like:

doctorpatients

58

01257721, 450mkkri, 8g2mrze2 …​

8

05woknfk, 116wmq8i, 2t5yttqi …​

…​ etc …​

…​ etc …​

This might be useful for a dashboard showing all the patients assigned to doctors, for instance. How can we get the data in this form, with N1QL or with MapReduce?

N1QL Aggregate grouping

N1QL gives us the ARRAY_AGG function to make this possible.

Start by selecting the doctorId from each patient document, and the key to the patient document. Then, apply ARRAY_AGG to the patient document ID. Finally, group the results by the doctorId.

SELECT p.doctorId AS doctor, ARRAY_AGG(META(p).id) AS patients
FROM patients p
GROUP BY p.doctorId;

Note: don’t forget to run CREATE PRIMARY INDEX ON patients for this tutorial to enable a primary index scan.

Imagine this query without the ARRAY_AGG. It would return one record for each patient. By adding the ARRAY_AGG and the GROUP BY, it now returns one record for each doctor.

Here’s a snippet of the results on the sample data set I created:

Aggregate grouping results in N1QL

If you don’t want to go through the trouble of creating a bucket and importing sample data, you can also try this in the N1QL tutorial sandbox. There aren’t patient documents in there, so the query will be a little different.

I’m going to group up emails by age. Start by selecting the age from each document, and the email from each document. Then, apply ARRAY_AGG to the email. Finally, group the results by the age.

SELECT t.age AS age, ARRAY_AGG(t.email) AS emails
FROM tutorial t
group by t.age;

Here’s a screenshot of some of the results from the sandbox:

N1QL sandbox results

Aggregate group with MapReduce

Similar aggregate grouping can also be achieved with a MapReduce View.

Start by creating a new View. From Couchbase Console, go to Indexes, then Views. Select the "patients" bucket. Click "Create Development View". Name a design document (I called mine "_design/dev_patient". Create a view, I called mine "doctorPatientGroup".

We’ll need both a Map and a custom Reduce function.

First, for the map, we just want the doctorId (in an array, since we’ll be using grouping) and the patient’s document ID.

function (doc, meta) {
    emit([doc.doctorId], meta.id);
}

Next, for the reduce function, we’ll take the values and concatenate them into an array. Below is one way that you can do it. I do not claim to be a JavaScript expert or a MapReduce expert, so there might be a more efficient way to tackle this:

function reduce(key, values, rereduce) {
    var merged = [].concat.apply([], values);
    return merged;
}

After you’ve created both map and reduce functions, save the index.

Finally, when actually calling this Index, set group_level to 1. You can do this in the UI:

Aggregate grouping with MapReduce

Or you can do it from the Index URL. Here’s an example from a cluster running on my local machine:

http://127.0.0.1:8092/patients/_design/dev_patients/_view/doctorPatientGroup?connection_timeout=60000&full_set=true&group_level=1&inclusive_end=true&skip=0&stale=false

The result of that view should look like this (truncated to look nicer in a blog post):

{"rows":[
{"key":[0],"value":["reo8th6f","g53x9e8d", ... ]},
{"key":[1],"value":["k4xkhmki","g1jtc0oj", ... ]},
{"key":[2],"value":["spp6gf3k","3z93wyan"]},
{"key":[3],"value":["qnx93fh3","gssusiun", ...]},
{"key":[4],"value":["qvqgb0ve","jm0g69zz", ...]},
{"key":[5],"value":["ywjfvad6","so4uznxx", ...]}
...
]}

Summary

I think the N1QL method is easier, but there may be performance benefits to using MapReduce in some cases. In either case, you can accomplish aggregate grouping just as easily (if not more easily) as in a relational database.

Interested in learning more about N1QL? Be sure to check out the complete N1QL tutorial/sandbox. Interested in MapReduce Views? Check out the MapReduce Views documentation to get started.

Did you find this post useful? Have suggestions for improvement? Please leave a comment below, or contact me on Twitter @mgroves.

Matthew D. Groves

About the Author

Matthew D. Groves lives in Central Ohio. He works remotely, loves to code, and is a Microsoft MVP.

Latest Comments

Twitter