Skip to main content

Posts tagged with 'Couchbase Server'

This is a repost that originally appeared on the Couchbase Blog: New Querying Features in Couchbase Server 5.5.

New querying features figure prominently in the latest release of Couchbase Server 5.5. Check out the announcement and download the developer build for free right now.

In this post, I want to highlight a few of the new features and show you how to get started using them:

  • ANSI JOINs - N1QL in Couchbase already has JOIN, but now JOIN is more standards compliant and more flexible.

  • HASH joins - Performance on certain types of joins can be improved with a HASH join (in Enterprise Edition only)

  • Aggregate pushdowns - GROUP BY can be pushed down to the indexer, improving aggregation performance (in Enterprise Edition only)

All the examples in this post use the "travel-sample" bucket that comes with Couchbase.

ANSI JOINs

Until Couchbase Server 5.5, JOINs were possible, with two caveats:

  1. One side of the JOIN has to be document key(s)

  2. You must use the ON KEYS syntax

In Couchbase Server 5.5, it is no longer necessary to use ON KEYS, and so writing joins becomes much more natural and more in line with other SQL dialects.

Previous JOIN syntax

For example, here’s the old syntax:

SELECT r.destinationairport, r.sourceairport, r.distance, r.airlineid, a.name
FROM `travel-sample` r
JOIN `travel-sample` a ON KEYS r.airlineid
WHERE r.type = 'route'
AND r.sourceairport = 'CMH'
ORDER BY r.distance DESC
LIMIT 10;

This will get 10 routes that start at CMH airport, joined with their corresponding airline documents. The result are below (I’m showing them in table view, but it’s still JSON):

ANSI join results

New JOIN syntax

And here’s the new syntax doing the same thing:

SELECT r.destinationairport, r.sourceairport, r.distance, r.airlineid, a.name
FROM `travel-sample` r
JOIN `travel-sample` a ON META(a).id = r.airlineid
WHERE r.type = 'route'
AND r.sourceairport = 'CMH'
ORDER BY r.distance DESC
LIMIT 10;

The only difference is the ON. Instead of ON KEYS, it’s now ON <field1> = <field2>. It’s more natural for those coming from a relational background (like myself).

But that’s not all. Now you are no longer limited to joining just on document keys. Here’s an example of a JOIN on a city field.

SELECT a.airportname, a.city AS airportCity, h.name AS hotelName, h.city AS hotelCity, h.address AS hotelAddress
FROM `travel-sample` a
INNER JOIN `travel-sample` h ON h.city = a.city
WHERE a.type = 'airport'
AND h.type = 'hotel'
LIMIT 10;

This query will show hotels that match airports based on their city.

ANSI join on fields

Note that for this to work, you must have an index created on the field that’s on the inner side of the JOIN. The "travel-sample" bucket already contains a predefined index on the city field. If I were to attempt it with other fields, I’d get an error message like "No index available for ANSI join term…​".

For more information on ANSI JOIN, check out the full N1QL JOIN documentation.

Note: The old JOIN, ON KEYS syntax will still work, so don’t worry about having to update your old code.

Hash Joins

Under the covers, there are different ways that joins can be carried out. If you run the query above, Couchbase will use a Nested Loop (NL) approach to execute the join. However, you can also instruct Couchbase to use a hash join instead. A hash join can sometimes be more performant than a nested loop. Additionally, a hash join isn’t dependent on an index. It is, however, dependent on the join being an equality join only.

For instance, in "travel-sample", I could join landmarks to hotels on their email fields. This may not be the best find to find out if a hotel is a landmark, but since email is not indexed by default, it illustrates the point.

SELECT l.name AS landmarkName, h.name AS hotelName, l.email AS landmarkEmail, h.email AS hotelEmail
FROM `travel-sample` l
INNER JOIN `travel-sample` h ON h.email = l.email
WHERE l.type = 'landmark'
AND h.type = 'hotel';

The above query will take a very long time to run, and probably time out.

Syntax

Next I’ll try a hash join, which must be explicitly invoked with a USE HASH hint.

SELECT l.name AS landmarkName, h.name AS hotelName, l.email AS landmarkEmail, h.email AS hotelEmail
FROM `travel-sample` l
INNER JOIN `travel-sample` h USE HASH(BUILD) ON h.email = l.email
WHERE l.type = 'landmark'
AND h.type = 'hotel';

A hash join has two sides: a BUILD and a PROBE. The BUILD side of the join will be used to create an in-memory hash table. The PROBE side will use that table to find matches and perform the join. Typically, this means you want the BUILD side to be used on the smaller of the two sets. However, you can only supply one hash hint, and only to the right side of the join. So if you specify BUILD on the right side, then you are implicitly using PROBE on the left side (and vice versa).

BUILD and PROBE

So why did I use HASH(BUILD)?

I know from using INFER and/or Bucket Insights that landmarks make up roughly 10% of the data, and hotels make up about 3%. Also, I know from just trying it out that HASH(BUILD) was slightly slower. But in either case, the query execution time was milliseconds. Turns out there are two hotel-landmark pairs with the same email address.

Hash join results

USE HASH will tell Couchbase to attempt a hash join. If it cannot do so (or if you are using Couchbase Server Community Edition), it will fall back to a nested-loop. (By the way, you can explicitly specify nested-loop with the USE NL syntax, but currently there is no reason to do so).

For more information, check out the HASH join areas of the documentation.

Aggregate pushdowns

Aggregations in the past have been tricky when it comes to performance. With Couchbase Server 5.5, aggregate pushdowns are now supported for SUM, COUNT, MIN, MAX, and AVG.

In earlier versions of Couchbase, indexing was not used for statements involving GROUP BY. This could severely impact performance, because there is an extra "grouping" step that has to take place. In Couchbase Server 5.5, the index service can do the grouping/aggregation.

Example

Here’s an example query that finds the total number of hotels, and groups them by country, state, and city.

SELECT country, state, city, COUNT(1) AS total
FROM `travel-sample`
WHERE type = 'hotel' and country is not null
GROUP BY country, state, city
ORDER BY COUNT(1) DESC;

The query will execute, and it will return as a result:

Aggregation result

Let’s take a look at the visual query plan (only available in Enterprise Edition, but you can view the raw Plan Text in Community Edition).

Query plan with no pushdown

Note that the only index being used is for the type field. The grouping step is doing the aggregation work. With the relatively small travel-sample data set, this query is taking around ~90ms on my single node desktop. But let’s see what happens if I add an index on the fields that I’m grouping by:

Indexing

CREATE INDEX ix_hotelregions ON `travel-sample` (country, state, city) WHERE type='hotel';

Now, execute the above SELECT query again. It should return the same results. But:

  • It’s now taking ~7ms on my single node desktop. We’re taking ms, but with a large, more realistic data set, that is a huge difference in magnitude.

  • The query plan is different.

Query plan with pushdown

Note that this time, there is no 'group' step. All the work is being pushed down to the index service, which can use the ix_hotelregions index. It can use this index because my query is exactly matching the fields in the index.

Index push down does not always happen: your query has to meet specific conditions. For more information, check out the GROUP BY and Aggregate Performance areas of the documentation.

Summary

With Couchbase Server 5.5, N1QL includes even more standards-compliant syntax and becomes more performant than ever.

Have a question for me? I’m on Twitter @mgroves. You can also check out @N1QL on Twitter. The N1QL Forum is a good place to go if you have in-depth questions about N1QL.

This is a repost that originally appeared on the Couchbase Blog: Alexa Skills with Azure Functions and Couchbase.

Alexa Skills are the "apps" that you can build to run on Amazon devices like the Echo, Echo Dot, etc. In this blog post, you’ll learn how to build an Alexa skill using serverless Azure Functions and a Couchbase backend running on Azure. This post builds on a lot of blog posts I’ve written about Azure Functions, Serverless, and Couchbase on Azure in the past:

What kind of Alexa skills am I building?

I work as a Developer Advocate, which means I often spend time at sponsor booths at developer events. I love doing this: I get to tell people how great Couchbase is, and I often get feedback from developers about what problems they’re trying to solve with Couchbase.

If there’s one thing I don’t like about working a booth, though, it’s repetition. I often get asked the same set of questions hundreds of times per events:

  • What is Couchbase? (distributed NoSQL document database with a memory first architecture)

  • How is Couchbase different than MongoDB? (they are both document databases, but Couchbase has major feature and architectural differences)

  • Is Couchbase the same thing as CouchDB? (No.)

I’m not complaining, mind you. It’s just that it’s hard to be enthusiastic when answering the question for the 100th time as the conference is about to close down.

But you know who is always enthusiastic? Alexa! So, if I bring my Echo Dot to the next event, maybe she can help me:

  • What is Couchbase? - Alexa will say a random interesting fact about Couchbase

  • How is Couchbase different than MongoDB? Alexa will say a random architectural or feature difference.

  • Is Couchbase the same thing as CouchDB? Alexa will say "no".

If these Alexa skills turn out to be helpful, I can expand the skills later to answer more complex questions.

If you want to follow along with this post and create your own Alexa skills, the full source code is available on Github.

Design

Alexa skills are registered with Amazon. For the most part, they make simple HTTP requests to the endpoint that you designate and expect a certain JSON response. Azure Functions can process HTTP requests. The Azure Functions can make queries out to a database full of responses, and can also keep track of how many times each response has been given.

Below is a high-level architectural diagram of my minimum viable Alexa skills project:

Architecture diagram from you

Data storage and design

The skill is going to ultimately query some data from Couchbase Server. I’ll start with 2 different kinds of documents. (If these Alexa skills turn out to be useful, I’ll add more complex stuff later).

Document design

Each document represents a possible response. Each will have 3 fields:

  • type - This will be either "mongodbcomparison" or "whatiscouchbase".

  • number - The number of times this response has been used (starting at 0).

  • text - The text that I want the Alexa skills to say.

The document key design of these documents is not important (at least not yet), since I’ll be using only N1QL (SQL for JSON) queries to retrieve them. However, I’ve decided to create keys like "mongo::2" and "couchbase::5".

To start, I will store this data in a single Couchbase node on a low cost Azure VM. A single node with a small amount of data should be able to handle even heavy booth traffic no problem. But if, for instance, I were to install these as kiosks in airports around the world I will definitely need to scale up my Couchbase cluster. Couchbase and Azure makes this easy.

Query design

To get a random document, I need to run a N1QL query:

SELECT m.*, META(m).id
FROM boothduty m
WHERE m.type = 'mongodbcomparison'
ORDER BY UUID()
LIMIT 1;

UUID is functioning as a random number generator. That’s not really what it’s for, but it’s "good enough". If I really needed true randomness, I could make a curl request in N1QL to random.org’s API.

To run that query, I need to create an index for the 'type' field:

CREATE INDEX ix_type ON boothduty(type);

Azure Functions

To create an Azure Function, I used an existing .NET library called AlexaSkills.NET, which makes it very easy to write the code you need to create Alexa skills.

After creating my Azure Functions solution, I added it with NuGet.

Using AlexaSkills.NET

Next, I created a "speechlet" class. I chose to make my speechlet asynchronous, but a synchronous option exists as well. There are four methods that need to be created. I only really need two of them for the skill at this point.

    public class BoothDutySpeechlet : SpeechletBase, ISpeechletWithContextAsync
    {
        public async Task<SpeechletResponse> OnIntentAsync(IntentRequest intentRequest, Session session, Context context)
        {
            try
            {
                var intentName = intentRequest.Intent.Name;
                var intentProcessor = IntentProcessor.Create(intentName);
                return await intentProcessor.Execute(intentRequest);
            }
            catch (Exception ex)
            {
                var resp = new SpeechletResponse();
                resp.ShouldEndSession = false;
                resp.OutputSpeech = new PlainTextOutputSpeech() { Text = ex.Message };
                return await Task.FromResult(resp);
            }
        }

        public Task<SpeechletResponse> OnLaunchAsync(LaunchRequest launchRequest, Session session, Context context)
        {
            var resp = new SpeechletResponse();
            resp.ShouldEndSession = false;
            resp.OutputSpeech = new PlainTextOutputSpeech() { Text = "Welcome to the Couchbase booth. Ask me about Couchbase." };
            return Task.FromResult(resp);
        }

        public Task OnSessionStartedAsync(SessionStartedRequest sessionStartedRequest, Session session, Context context)
        {
            return Task.Delay(0); // nothing to do (yet)
        }

        public Task OnSessionEndedAsync(SessionEndedRequest sessionEndedRequest, Session session, Context context)
        {
            return Task.Delay(0); // nothing to do (yet)
        }

        // I only need to use this when I'm testing locally
//        public override bool OnRequestValidation(SpeechletRequestValidationResult result, DateTime referenceTimeUtc,
//            SpeechletRequestEnvelope requestEnvelope)
//        {
//            return true;
//        }
    }

The OnLaunchAsync is the first thing that an Echo user will reach. The user will say something like "Alexa, open Matt’s booth helper", and this code will respond with some basic instructions.

The OnIntentAsync is where most of the Alexa skills request will be processed. I’m using a factory/strategy code pattern here to instantiate a different object depending on which intent is being invoked (more on "intents" later).

public static IIntentProcessor Create(string intentName = "FallbackIntent")
{
    switch (intentName)
    {
        case "MongodbComparisonIntent":
            return new MongoDbComparisonIntentProcessor(CouchbaseBucket.GetBucket());
        case "WhatIsCouchbaseIntent":
            return new WhatIsCouchbaseIntentProcessor(CouchbaseBucket.GetBucket());
        case "CouchDbIntent":
            return new CouchDbIntentProcessor();
        case "FallbackIntent":
            return new FallbackIntentProcessor();
        default:
            return new FallbackIntentProcessor();
    }
}

Connecting to Couchbase

CouchbaseBucket.GetBucket() is using Lazy behind the scenes as outlined in my earlier blog post on Azure Functions.

So, whenever a 'What is Couchbase' intent comes in, a WhatIsCouchbaseIntentProcessor is instantiated and executed.

public class WhatIsCouchbaseIntentProcessor : BaseIntentProcessor
{
    private readonly IBucket _bucket;

    public WhatIsCouchbaseIntentProcessor(IBucket bucket)
    {
        _bucket = bucket;
    }

    public override async Task<SpeechletResponse> Execute(IntentRequest intentRequest)
    {
        // get random fact from bucket
        var n1ql = @"select m.*, meta(m).id
                        from boothduty m
                        where m.type = 'whatiscouchbase'
                        order by `number`, uuid()
                        limit 1;";
        var query = QueryRequest.Create(n1ql);
        var result = await _bucket.QueryAsync<BoothFact>(query);
        if (result == null || !result.Rows.Any())
            return await CreateErrorResponseAsync();
        var fact = result.First();

        // increment fact count
        await _bucket.MutateIn<dynamic>(fact.Id)
            .Counter("number", 1)
            .ExecuteAsync();

        // return text of fact
        return await CreatePlainTextSpeechletReponseAsync(fact.Text);
    }
}

Note the use of the N1QL query that was mentioned earlier (slightly tweaked so that facts with lower numbers will be given priority). This code is also using the Couchbase subdocument API to increment the "number" field by 1.

You can view the full code of the other intent processors on Github, but they are very similar (just with slightly different N1QL).

Connecting to Azure Functions

Finally, once my speechlet is ready, it’s easy to wire up to an Azure Function.

public static class BoothDuty
{
    [FunctionName("BoothDuty")]
    public static async Task<HttpResponseMessage> Run([HttpTrigger(AuthorizationLevel.Anonymous, "get", "post", Route = null)]HttpRequestMessage req, TraceWriter log)
    {
        var speechlet = new BoothDutySpeechlet();
        return await speechlet.GetResponseAsync(req);
    }
}

You can now test this locally with Postman, or with the Alexa interface once you deploy to azure.

Creating the Alexa skills

I won’t go through the whole process, since there’s plenty of documentation on how to setup Alexa skills. I think I have more work to do before my skill is officially certified, but it’s good enough for beta testing.

Once you have the Azure Functions URL, you’ll use that with Alexa. Alexa requires skills to use HTTPS, but fortunately Azure Functions come with HTTPS on a azurewebsites.net subdomain. Here’s a screenshot:

Alexa skills HTTPS configuration with Azure Functions

I mentioned "intents" earlier. These are various types of actions that Alexa skills can process, along with their inputs. Think of these like function signatures. Currently, I have designed 3 intents, and I have no parameters on these (yet). So my intent schema is a very simple piece of JSON:

Alexa skills intent schema

For each intent, you can create "utterances" that map to the intents. These are the phrases that an Echo user will speak, and which intent they correspond to.

Alexa skills sample utterances

I’ve tried to think of all the different variations. But if I really wanted this to work more generally, I would setup parameters so that a user could ask the question "What is the difference between Couchbase and {x}".

Echo Dot in action

I did not publish this on the Alexa store. I did deploy it as a "beta test", so if you want to try it out, I’d be happy to send you an invitation to get it.

Here’s a video of my trying it out on my Echo Dot (which was a speaker gift last year from the fine people at DevNexus):

Will this actually work at a noisy booth? Well, let’s just say I’m not ready to bring an easy chair and pillow to the booth just yet. But it’s a fun way to demonstrate the power of Couchbase as an engagement database.

Summary

Alexa skills are a great place to use serverless architecture like Azure Functions. The skills will be used intermittently, and Azure Functions will only bill you for the time they are executed.

Couchbase Server again makes a great database for such an app. It can start out small to handle a single booth, but it can scale easily to accommodate larger demand.

Have a question about Couchbase? Visit the Couchbase forums.

Have a question for me? Find me on Twitter @mgroves.

Be sure to check out all the great documentation from Microsoft on Azure Functions, and the documentation on the Alexa Skills .NET library.

This is a repost that originally appeared on the Couchbase Blog: Proof of Concept: Making a case to move from relational.

Proof of concept may be just what you need to start when you’re evaluating Couchbase.

We’ve been blogging a lot about the technical side of moving from a relational database like Oracle or SQL Server to Couchbase. Here are some of the resources and posts we’ve published:

But for this post, we’re going to talk more about the overall process instead of the technical details. You’ll see five steps to creating a successful proof of concept. And if you ever need help getting started, you can talk to a Couchbase Solutions Engineer.

Proof of Concept steps

These steps are not just for migrating an existing application to Couchbase, they also work just as well for creating a brand new "greenfield" application with Couchbase, or even augmenting an existing database (as opposed to replacing it completely).

When creating a Proof of Concept, it’s a good idea to keep the scope as small and simple as possible. Some questions to ask:

  • Will it prove/disprove what you need it to, and help you move to the next step?

  • Can this be accomplished fairly quickly? If it takes too long or isn’t a priority, it might fizzle out.

  • Ask a Couchbase technical team member: is this a good fit for Couchbase? You can draw on their experience to save yourself some heartburn.

Select a use case and application

When I talk to people about Couchbase and NoSQL, I tell them the only thing worse than not using Couchbase is using Couchbase for the wrong thing and becoming soured on document databases.

The benefits of a distributed database like Couchbase are:

  • Better performance

  • Better scalability

  • Higher availability

  • Greater data agility/flexibility

  • Improved operational management

If your application can benefit from one of those characteristics, it’s worth checking out Couchbase. Couchbase may not be the best fit if you need multi-document transactions. But as I showed in my post on data modeling, if you can nest data instead of scattering it in pieces, you may not need multi-document transactions as much as you think.

Further, conversations with Couchbase customers have lead us to identify the need beyond a traditional database to power interactions. Marriott calls this the "look-to-book" ratio.

Think about the interaction to transaction ratio in your proof of concept

If you’re in a situation where you need to record transactions in your traditional database, but you want a low-latency, flexible, scalable database to power all the interactions leading up to it, Couchbase might be the right fit for you.

Some use cases that Couchbase has been a great fit for include:

Define the success criteria

Once you’ve decided that you have a use case that would be good for Couchbase, you need to define what it means for a proof of concept to be successful.

Examples of criteria:

  • Performance/latency improvements - This might boil down to a number, like "5ms latency in the 95th percentile".

  • Ease of scaling - How easy is it to scale now? How much time does it take a person? How many 2am Saturdays do you need to work to do upgrades?

  • Faster development cycles - Does schema management eat up a lot of time in your sprints? A proof of concept with Couchbase can help to demonstrate if a flexible model is going to save you time.

  • Maintenence and costs

Whatever the criteria, it’s good to define it at the beginning, so you can work towards trying to achieve that. A vague goal like "I just want to play around with NoSQL" is fine for an individual developer, but a well-defined success criteria is going to be critical for convincing decision makers.

Understand your data

As I covered in the JSON data modeling post, it’s important for you to understand your data before you even start writing any code. You need to understand what you are going to model and how your application needs to function.

Migrating from a relational to a document database is not going to be a purely mechanical exercise. If you plan to migrate data, it’s better to start by thinking about how it would look independent of how it’s currently stored. Draw out a concept of it on a whiteboard without using "tables" or "documents".

Identify the access patterns

I also covered this in my JSON data modeling post. Couchbase is very flexible in the way that it can store data. But, data access is also flexible. The design of your model should take that into account.

In that blog post, I layed out some rules of thumb for nested/seperate documents. At a higher level, you can start with thinking about data access like this:

  • Key/value - The ability to get/change a document based on its key. This is the fastest, lowest latency method available in Couchbase.

  • N1QL query - N1QL is SQL for JSON data, available in Couchbase. It can query data just about any way you can imagine. Most importantly, you can query data based on something other than its key.

  • Full Text Search - When you need to query based on text in a language aware way. Great for user driven searches, for instance.

  • Map/Reduce - Writing a pure function to calculate query results ahead of time. N1QL is taking a lot of the workload away from M/R, but it’s still good for some specialized types of aggregation.

  • Geospatial - Querying of documents based on some geographical/location based information.

  • Analytics/reporting - Couchbase Analytics (currently in preview) can give you heavily indexed non-operational access to your data. You can run complex reports without impacting day-to-day users.

Review the architecure

At the end of your proof of concept, you can measure your results against the criteria that you created at the very beginning.

It might be a good idea to iterate on this proof on concept: you can apply what you’ve leaned in each subsequent iteration. If you keep the iterations short, you can learn what you’ve applied faster. This isn’t just true of Couchbase, by the way, but anything!

Finally, if your proof of concept is a success (and I know it will be), then it’s time to prepare for production. Take the time to review the architecture, the decisions you’ve made, what worked well, what didn’t work well, and so on. The more you document, the better off the rest of your team and organization will be on the next project.

Summary

Creating a proof of concept with these five steps will help make you successful! All that’s left to do is get started:

This is a repost that originally appeared on the Couchbase Blog: JSON Data Modeling for RDBMS Users.

JSON data modeling is a vital part of using a document database like Couchbase. Beyond understanding the basics of JSON, there are two key approaches to modeling relationships between data that will be covered in this blog post.

The examples in this post will build on the invoices example that I showed in CSV tooling for migrating to Couchbase from Relational.

Imported Data Refresher

In the previous example, I started with two tables from a relational database: Invoices and InvoicesItems. Each invoice item belongs to an invoice, which is done with a foreign key in a relational database.

I did a very straightforward (naive) import of this data into Couchbase. Each row became a document in a "staging" bucket.

Data imported from CSV

Next, we must decide if that JSON data modeling design is appropriate or not (I don’t think it is, as if the bucket being called "staging" didn’t already give that away).

Two Approaches to JSON data modeling of relationships

With a relational database, there is really only one approach: normalize your data. This means separate tables with foreign keys linking the data together.

With a document database, there are two approaches. You can keep the data normalized or you can denormalize data by nesting it into its parent document.

Normalized (separate documents)

An example of the end state of the normalized approach represents a single invoice spread over multiple documents:

key - invoice::1
{ "BillTo": "Lynn Hess", "InvoiceDate": "2018-01-15 00:00:00.000", "InvoiceNum": "ABC123", "ShipTo": "Herman Trisler, 4189 Oak Drive" }

key - invoiceitem::1811cfcc-05b6-4ace-a52a-be3aad24dc52
{ "InvoiceId": "1", "Price": "1000.00", "Product": "Brake Pad", "Quantity": "24" }

key - invoiceitem::29109f4a-761f-49a6-9b0d-f448627d7148
{ "InvoiceId": "1", "Price": "10.00", "Product": "Steering Wheel", "Quantity": "5" }

key - invoiceitem::bf9d3256-9c8a-4378-877d-2a563b163d45
{ "InvoiceId": "1", "Price": "20.00", "Product": "Tire", "Quantity": "2" }

This lines up with the direct CSV import. The InvoiceId field in each invoiceitem document is similar to the idea of a foreign key, but note that Couchbase (and distributed document databases in general) do not enforce this relationship in the same way that relational databases do. This is a trade-off made to satisfy the flexibility, scalability, and performance needs of a distributed system.

Note that in this example, the "child" documents point to the parent via InvoiceId. But it could also be the other way around: the "parent" document could contain an array of the keys of each "child" document.

Denormalized (nested)

The end state of the nested approach would involve just a single document to represent an invoice.

key - invoice::1
{
  "BillTo": "Lynn Hess",
  "InvoiceDate": "2018-01-15 00:00:00.000",
  "InvoiceNum": "ABC123",
  "ShipTo": "Herman Trisler, 4189 Oak Drive",
  "Items": [
    { "Price": "1000.00", "Product": "Brake Pad", "Quantity": "24" },
    { "Price": "10.00", "Product": "Steering Wheel", "Quantity": "5" },
    { "Price": "20.00", "Product": "Tire", "Quantity": "2" }
  ]
}

Note that "InvoiceId" is no longer present in the objects in the Items array. This data is no longer foreign—​it’s now domestic—​so that field is not necessary anymore.

JSON Data Modeling Rules of Thumb

You may already be thinking that the second option is a natural fit in this case. An invoice in this system is a natural aggregate-root. However, it is not always straightforward and obvious when and how to choose between these two approaches in your application.

Here are some rules of thumb for when to choose each model:

Table 1. Modeling Data Cheat Sheet
If …​Then consider…​

Relationship is 1-to-1 or 1-to-many

Nested objects

Relationship is many-to-1 or many-to-many

Separate documents

Data reads are mostly parent fields

Separate document

Data reads are mostly parent + child fields

Nested objects

Data reads are mostly parent or child (not both)

Separate documents

Data writes are mostly parent and child (both)

Nested objects

Modeling example

To explore this deeper, let’s make some assumptions about the invoice system we’re building.

  • A user usually views the entire invoice (including the invoice items)

  • When a user creates an invoice (or makes changes), they are updating both the "root" fields and the "items" together

  • There are some queries (but not many) in the system that only care about the invoice root data and ignore the "items" fields

Then, based on that knowledge, we know that:

  1. The relationship is 1-to-many (a single invoice has many items)

  2. Data reads are mostly parent + child fields together

Therefore, "nested objects" seems like the right design.

Please remember that these are not hard and fast rules that will always apply. They are simply guidelines to help you get started. The only "best practice" is to use your own knowledge and experience.

Transforming staging data with N1QL

Now that we’ve done some JSON Data Modeling exercises, it’s time to transform the data in the staging bucket from separate documents that came directly from the relational database to the nested object design.

There are many approaches to this, but I’m going to keep it very simple and use Couchbase’s powerful N1QL language to run SQL queries on JSON data.

Preparing the data

First, create a "operation" bucket. I’m going to transform data and move it to from the "staging" bucket (containing the direct CSV import) to the "operation" bucket.

Next, I’m going to mark the 'root' documents with a "type" field. This is a way to mark documents as being of a certain type, and will come in handy later.

UPDATE staging
SET type = 'invoice'
WHERE InvoiceNum IS NOT MISSING;

I know that the root documents have a field called "InvoiceNum" and that the items do not have this field. So this is a safe way to differentiate.

Next, I need to modify the items. They previously had a foreign key that was just a number. Now those values should be updated to point to the new document key.

UPDATE staging s
SET s.InvoiceId = 'invoice::' || s.InvoiceId;

This is just prepending "invoice::" to the value. Note that the root documents don’t have an InvoiceId field, so they will be unaffected by this query.

After this, I need to create an index on that field.

Preparing an index

CREATE INDEX ix_invoiceid ON staging(InvoiceId);

This index will be necessary for the transformational join coming up next.

Now, before making this data operational, let’s run a SELECT to get a preview and make sure the data is going to join together how we expect. Use N1QL’s NEST operation:

SELECT i.*, t AS Items
FROM staging AS i
NEST staging AS t ON KEY t.InvoiceId FOR i
WHERE i.type = 'invoice';

The result of this query should be three total root invoice documents.

Results of transformation with N1QL

The invoice items should now be nested into an "Items" array within their parent invoice (I collapsed them in the above screenshot for the sake of brevity).

Moving the data out of staging

Once you’ve verified this looks correct, the data can be moved over to the "operation" bucket using an INSERT command, which will just be a slight variation on the above SELECT command.

INSERT INTO operation (KEY k, VALUE v)
SELECT META(i).id AS k, { i.BillTo, i.InvoiceDate, i.InvoiceNum, "Items": t } AS v
FROM staging i
NEST staging t ON KEY t.InvoiceId FOR i
where i.type = 'invoice';

If you’re new to N1QL, there’s a couple things to point out here:

  • INSERT will always use KEY and VALUE. You don’t list all the fields in this clause, like you would in a relational database.

  • META(i).id is a way of accessing a document’s key

  • The literal JSON syntax being SELECTed AS v is a way to specify which fields you want to move over. Wildcards could be used here.

  • NEST is a type of join that will nest the data into an array instead of at the root level.

  • FOR i specifies the left hand side of the ON KEY join. This syntax is probably the most non-standard portion of N1QL, but the next major release of Couchbase Server will include "ANSI JOIN" functionality that will be a lot more natural to read and write.

After running this query, you should have 3 total documents in your 'operation' bucket representing 3 invoices.

Result from JSON data modeling transformation

You can delete/flush the staging bucket since it now contains stale data. Or you can keep it around for more experimentation.

Summary

Migrating data straight over to Couchbase Server can be as easy as importing via CSV and transforming with a few lines of N1QL. Doing the actual modeling and making decisions requires the most time and thought. Once you decide how to model, N1QL gives you the flexibility to transform from flat, scattered relational data into an aggregate-oriented document model.

More resources:

Feel free to contact me if you have any questions or need help. I’m @mgroves on Twitter. You can also ask questions on the Couchbase Forums. There are N1QL experts there who are very responsive and can help you write the N1QL to accommodate your JSON data modeling.

This is a repost that originally appeared on the Couchbase Blog: CSV tooling for migrating to Couchbase from Relational.

CSV (Comma-seperated values) is a file format that can be exported from a relational database (like Oracle or SQL Server). It can then be imported into Couchbase Server with the cbimport utility.

Note: cbimport comes with Couchbase Enterprise Edition. For Couchbase Community Edition, you can use the more limited cbtransfer tool or go with cbdocloader if JSON is an option.

A straight relational→CSV→Couchbase ETL probably isn’t going to be the complete solution for data migration. In a later post, I’ll write about data modeling decisions that you’ll have to consider. But it’s a starting point: consider this data as "staged".

Note: for this post, I’m using SQL Server and a Couchbase Server cluster, both installed locally. The steps will be similar for SQL Server, Oracle, MySQL, PostgreSQL, etc.

Export to CSV

The first thing you need to do is export to CSV. I have a relational database with two tables: Invoices and InvoiceItems.

Relational tables example

I’m going to export the data from these two tables into two CSV files. With SQL Server Management Studio, this can be done a number of different ways. You can use sqlcmd or bcp at the command line. Or you can use Powershell’s Invoke-Sqlcmd and pipe it through Export-Csv. You can also use the SQL Server Management Studio UI.

Export CSV from SQL Server Management Studio

Other relational databases will have command line utilities, UI tools, etc to export CSV.

Here is an example of a CSV export from a table called "Invoices":

Id,InvoiceNum,InvoiceDate,BillTo,ShipTo
1,ABC123,2018-01-15 00:00:00.000,Lynn Hess,"Herman Trisler, 4189 Oak Drive"
2,XYZ987,2017-06-23 00:00:00.000,Yvonne Pollak,"Clarence Burton, 1470 Cost Avenue"
3,FOO777,2018-01-02 00:00:00.000,Phillip Freeman,"Ronda Snell, 4685 Valley Lane"

Here’s an export from a related table called "InvoiceItems":

InvoiceId,Product,Quantity,Price
1,Tire,2,20.00
1,Steering Wheel,5,10.00
1,Engine Oil,10,15.00
1,Brake Pad,24,1000.00
2,Mouse pad,1,3.99
2,Mouse,1,14.99
2,Computer monitor,1,199.98
3,Cupcake,12,.99
3,Birthday candles,1,.99
3,Delivery,1,30.00

Load CSV into Couchbase

Let’s import these into a Couchbase bucket. I’ll assume you’ve already created an empty bucket named "staging".

First, let’s import invoices.csv.

Loading invoices

C:\Program Files\Couchbase\Server\bin\cbimport csv -c localhost -u Administrator -p password -b staging -d file://invoices.csv --generate-key invoice::%Id%

Note: with Linux/Mac, instead of C:\Program Files\Couchbase\Server\bin, the path will be different.

Let’s break this down:

  • cbimport: This is the command line utility you’re using

  • csv: We’re importing from a CSV file. You can also import from JSON files.

  • -c localhost: The location of your Couchbase Server cluster.

  • -u Administrator -p password: Credentials for your cluster. Hopefully you have more secure credentials than this example!

  • -b staging: The name of the Couchbase bucket you want the data to end up in

  • --generate-key invoice::%Id% The template that will be used to create unique keys in Couchbase. Each line of the CSV will correspond to a single document. Each document needs a unique key. I decided to use the primary key (integer) with a prefix indicating that it’s an invoice document.

The end result of importing a 3 line file is 3 documents:

CSV documents imported into Couchbase

At this point, the staging bucket only contains invoice documents, so you may want to perform transformations now. I may do this in later modeling examples, but for now let’s move on to the next file.

Loading invoice items

C:\Program Files\Couchbase\Server\bin\cbimport csv -c localhost -u Administrator -p password -b staging -d file://invoice_items.csv --generate-key invoiceitem::#UUID#

This is nearly identical to the last import. One difference is that it’s a new file (invoice_items.csv). But the most important difference is --generate—​key. These records only contain foreign keys, but each document in Couchbase must have a unique key. Ultimately, we may decide to embed these records into their parent Invoice documents. But for now I decided to use UUID to generate unique keys for the records.

The end result of importing this 10 line file is 10 more documents:

More CSV documents imported into Couchbase

What’s next?

Once you have a CSV file, it’s very easy to get data into Couchbase. However, this sort of direct translation is often not going to be enough on its own. I’ve explored some aspects of data modeling in a previous blog post on migrating from SQL Server, but I will revisit this Invoices example in a refresher blog post soon.

In the meantime, be sure to check out How Couchbase Beats Oracle for more information on why companies are replacing Oracle for certain use cases. And also take a look at the Moving from Relational to NoSQL: How to Get Started white paper.

If you have any questions or comments, please feel free to leave them here, contact me on Twitter @mgroves, or ask your question in the Couchbase Forums.

Matthew D. Groves

About the Author

Matthew D. Groves lives in Central Ohio. He works remotely, loves to code, and is a Microsoft MVP.

Latest Comments

Twitter