Migrating Big Sites to Craft CMS 3

All the data bits

In the last 9 months or so we have completed at least 3 major migrations to Craft CMS 3. These all happened to be migrations from Craft 2 sites, but the outline of the following approach applies just as well to migrations from any other CMS into Craft 3. In subsequent guides we will offer more details on the various broad categories covered. This post is an initial overview.

Aristotle said, "Well begun is half done." This applies of course to web development projects. But there is an important caveat. You can't plan everything down to the last detail. Things change. The details will shift during the project, no matter what. You'll waste time if you obsess about every nuance. But you also can't just throw yourself into a project without some effort to uncover and understand the risks and opportunities that lie ahead.

Planning for a migration project centers mainly on looking for data anomalies. You're searching for hidden weird quirky stuff in data that otherwise looks homogenous. You will encounter these data problems sometime between project initiation and launch; it's better if it's as far in advance of launch as possible.

Take a discovery approach to planning a migration and you will do just fine. This means find ways to discover the hidden stuff. Find ways to discovery legacy problems with how content areas and sub sections were set up years ago. In most cases your web development team should be charging you for the discovery work. This work leans heavily on the most experienced and smartest minds on the team. Those experts cost money. And the money is very well spent.

Set up a dev site with real horsepower
Some sites are big or extremely complex, and sometimes both. There is a lot of work involved in moving them from one place to another, or one platform version to another. You have to be ready to break things. That's why you need a development environment.

You discovered a lot of the hidden problems in the planning phase above. Now it's time to get things set up so that you can break things in constructive ways, and do it safely. You want to be able to break things fast and fix them fast. This means your dev site needs horsepower.

For years web developers tried to save their clients money by spending as little as possible on web assets that are not client-facing. This meant that data migration scripts would run slowly and sometimes have to be left overnight to do their work. This is opposite of what you actually want.

In order to keep a migration project on budget, you need to keep it on its timeline. The best way to do that is to make sure that the dev site(s) used by the web teams stays out of the way. You want reliability in your development cycle. This means you want friction reduced. This means you want speed and power in your development systems.

Audit your content models
You have years of cruft at the content organization level. What made sense for the organization of data in a section of the site years ago does not make sense today. Your team is more mature. Your publishing process is more mature. You are more clear on what's essential in your content and what's in the way. Audit your content models and take this opportunity of migration to change those.

Since you are migrating content from one system to another, using the method outlined below, you have an opportunity to rework how you organize your information. You have an opportunity to put into practice what you have learned about the strengths and weaknesses of your content over the years.

Set up outbound JSON feeds
You are decoupling the two systems. The origin system, the one that is live already, will not migrate directly to the new destination system . The main reason for the decoupling is that you are still publishing heavily every day. You can't turn off the tap. If you stop that flow of content, the flow of your audience is also interrupted. Since you have to continue to publish, but you have to also wrangle a complex system, you need to decouple them to free yourself up to get work done.

Your first step in the decoupling is to set up outbound feeds from your origin system. The best format for feeds like this is JSON. JSON means JavaScript Object Notation. It's a data format built into JavaScript which is a scripting language common on every browser and fully ubiquitous across the web. Most modern APIs use JSON as their format for exchanging data. You should too.

So set up your JSON feeds. In some cases you can use behavior built in to your CMS for this. There may be plugins you can use to quickly spin up JSON feeds. But in all likelihood you want to hand code these JSON feeds. Why? Because this creates an opportunity for you to curate your data and shape it before it is consumed by the destination system. Remember the step above where you gave yourself permission to refactor your data model? These JSON feeds are where you will represent some aspects of that new data model. It's part of where you translate your content from one place to another.

Set up your Craft data model
Craft 3 is awesome. It is so flexible in how it handles your content. Setting up the data model is your golden opportunity to get things to be exactly the way you want them. You don't have to fully predict the future. You can change things later. But now is your chance to get rid of a lot of the organizational problems and legacy issues from your old site.

In Craft you will make choices about what kinds of content belong in channels, what will be structures and what will be singles.

Imagine you have a working group within your organization that offers a range of services. They have a main landing page that shows briefs of each service area. Users click to dive into the specific detail landing pages for each service area. The main landing page is a single in Craft. It's a single section + single entry in that section with its own custom set of fields organized into tabs. Those fields capture the summary data for each service area and included in that are the links to the service area landing pages or separate microsites that provide more detail. Then you have a separate Craft section for the on-site landing pages. Each service area gets its own entry. But all landing pages in this section share the same field and tab set. This is just one of thousands of examples of content organization in Craft. But the key here is anticipating your needs and creating future flexibility for yourself.

Configure FeedMe
You created a set of JSON feeds in a previous step. There you prepared feeds of data from your legacy system, the one we call the origin. Now you configure the harvesting tool; FeedMe.

FeedMe is a first party plugin in Craft. Its job is to harvest content from external feeds. Its favorite feed format is JSON. With FeedMe you create multiple incoming feeds. Each incoming feed is mapped to one of your origin feeds. FeedMe can map content to any type of object in Craft.

FeedMe can update entries that were previously imported. This is key. You are going to rerun these feeds over and over again. Each time you run them you check the data and fix issues that you find. This is the time consuming iterative part. You will run the feeds over and over, tuning the import each time until you finally get the content in the new Craft site to about 90% perfection. Humans take over at this point.

Set up inbound JSON verification feeds
In the next step you will validate your data. If you are dealing with a large volume of content, you need a way to handle that content at the detail level without causing the humans who manage it to go blind or cry too much. We'll look more closely at that below. For now it will be very helpful to have mirror JSON feeds on the destination site. These feeds are a little fake in terms of how they present your new data model. The job of these destination feeds is to mirror the origin feeds. This will facilitate the kinds of comparison work you will be doing at scale.

Prepare validation architecture
Most of the migrations into Craft will be from large sites. This means there will be thousands of entries and many thousands of data nodes. All of this must be checked and rechecked to make sure that everything was migrated over correctly. From text, to image and media assets, to authors and content relationships, everything must be validated.

When you're validating data of this type and volume, it is not realistic to have a human or a team of humans do the work. That's error prone and expensive. It also does not make sense to have a machine do the validation work. The machine would have to be taught in detail what to look for. That's also prohibitive. The sweet spot is a cyborg. You want a system, a method which allows a small team of people to work in partnership with a machine to validate the data and make manual corrections.

We completed a large Craft 3 migration with ProPublica recently and built a validation tool to help our data management team validate and correct migrated content. Here's a video of that tool in action so that you can see how we did it. The main idea here is that you already have an outbound JSON feed from your origin site. Now you need a mirror feed on the destination site. You will run a scripting tool that will allow humans to compare the two feeds. They should be mirror images of each other. And when they are not, that's where you get to work and correct your data.

Run FeedMe, rinse and repeat
As we saw above, you will run your content migrations over and over again. You'll refine them each time. And since FeedMe inside Craft 3 is smart enough to update records you previously imported, you shouldn't have massive duplication problems.

Big data migrations are scary. It helps a great deal to know that you will be running the migration over and over again in a safe environment, refining it every time until your content is finally nice and tidy and ready for launch. Sometimes psychological factors like this one are a big part of the Web Reliability matrix we propound at Solspace.

Switch to manual mode to deal with remaining data outliers
You have your feeds. You have harvested them into Craft 3 using FeedMe. You have done so over and over again until your content integrity is at about 90%. That's close enough. You have now hit the point of diminishing returns as far as getting computers to do smart stuff for you. Now it's time for your data people to take over and push everything over the finish line.

At Solspace every one of our large builds includes our chief content management guardian, Missy. Missy's expertise is in content management. By this I mean that she knows how to quickly get familiar with a client's data model. She knows where to look for problems based on her many years of experience and she knows how to execute quickly. On a large data project you need someone with the skills and expertise of a content guardian to make sure that your data integrity is maintained and your site is launchable.

Double post leading up to launch
Though you may try to avoid it, as you prepare your launch sequence and start flipping the switches, you likely need to double publish. Your content producers will publish in the origin system as well as the destination system at the same time. This is usually unavoidable for a couple of days during the launch sequence. All of the previous steps you undertook help avoid a long and cumbersome double publish cycle, but be ready for it.

Once you and your dev ops team are ready, you can finish off your last launch sequence steps. Congratulations! This took weeks or even months, but you are finally live with your new Craft CMS 3 website. (Don't forget to pull down your validation feeds. They are no longer needed.)