September 26, 2014

How to build hreflang XML Sitemaps with Excel

Building hreflang sitemaps with Excel

When someone is internationalizing his organic acquisition efforts, hreflang sitemaps are a great way to indicate to Google the various available languages of a website and spread the SEO visibility across the different markets.

Google also introduced “x-default hreflang” tag back in April 2013 in order to help brands direct visitors to the localized version of their homepage as a step of improving user experience. The “x-default hreflang” tag facilitates search engines to understand a Homepage that isn’t targetting a specific region but has a more generic (global targetting). The tag is also supported by Yandex

How hreflang tag works

The hreflang tag must be used by websites that target various geographies and languages and can be implemented directly on sitemaps (follow the post for instructions) when the on-page implementation is difficult. By using the hreflang tag, the webmaster can show Google the proper landing page for each google searcher.

To understand better how hreflang tag works, let’s work with an example. As an example, I’ll use the world’s leading hotel website, Booking.com

Booking.com has already an on-page implementation of hreflang tag, as we can notice on their source code. It looks like this on the homepage:

<link rel="alternate" type="text/html" hreflang="en-gb" href="http://www.booking.com/index.en-gb.html" title="English (UK)" />
<link rel="alternate" type="text/html" hreflang="en-us" href="http://www.booking.com/" title="English (US)" />
<link rel="alternate" type="text/html" hreflang="de" href="http://www.booking.com/index.de.html" title="Deutsch" />
<link rel="alternate" type="text/html" hreflang="nl" href="http://www.booking.com/index.nl.html" title="Nederlands" />
<link rel="alternate" type="text/html" hreflang="fr" href="http://www.booking.com/index.fr.html" title="Français" />
<link rel="alternate" type="text/html" hreflang="es" href="http://www.booking.com/index.es.html" title="Español" />
<link rel="alternate" type="text/html" hreflang="ca" href="http://www.booking.com/index.ca.html" title="Català" />
<link rel="alternate" type="text/html" hreflang="it" href="http://www.booking.com/index.it.html" title="Italiano" />
<link rel="alternate" type="text/html" hreflang="pt-pt" href="http://www.booking.com/index.pt-pt.html" title="Português (PT)" />
<link rel="alternate" type="text/html" hreflang="pt-br" href="http://www.booking.com/index.pt-br.html" title="Português (BR)" />
<link rel="alternate" type="text/html" hreflang="no" href="http://www.booking.com/index.no.html" title="Norsk" />
<link rel="alternate" type="text/html" hreflang="fi" href="http://www.booking.com/index.fi.html" title="Suomi" />
<link rel="alternate" type="text/html" hreflang="sv" href="http://www.booking.com/index.sv.html" title="Svenska" />
<link rel="alternate" type="text/html" hreflang="da" href="http://www.booking.com/index.da.html" title="Dansk" />
<link rel="alternate" type="text/html" hreflang="cs" href="http://www.booking.com/index.cs.html" title="Čeština" />
.
.
.
.
.
<link rel="alternate" type="text/html" hreflang="sl" href="http://www.booking.com/index.sl.html" title="Slovenščina" />
<link rel="alternate" type="text/html" hreflang="vi" href="http://www.booking.com/index.vi.html" title="Tiếng Việt" />
<link rel="alternate" type="text/html" hreflang="tl" href="http://www.booking.com/index.tl.html" title="Filipino" />
<link rel="alternate" type="text/html" hreflang="is" href="http://www.booking.com/index.is.html" title="Íslenska" />

So as we can see on the highlighted code above, Booking.com homepage indicates to search engines different versions of this specific page according to countries and / or languages. In Booking.com’s strategy is to point the root domain to the US market and not have a broad targeting strategy by using the “X-default” hreflang tag.

But how the code above would look like into an XML Sitemap? For the purposes of demonstration, I’ll limit the URLs to only 4:

http://www.booking.com/
http://www.booking.com/index.en-gb.html
http://www.booking.com/index.de.html
http://www.booking.com/index.nl.html

The hreflang sitemap for the Homepage of Booking.com with 3 different targeting options should look like this:

http://www.booking.com/hotel/de/h2-berlin-alexanderplatz.en-gb.html

[http://www.booking.com/hotel] [/de] [h2-berlin-alexanderplatz. [en-gb] [.html]
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
  xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>http://www.booking.com</loc>
    <xhtml:link 
                 rel="alternate"
                 hreflang="en-US"
                 href="http://www.booking.com"
                 />
    <xhtml:link 
                 rel="alternate"
                 hreflang="en-GB"
                 href="http://www.booking.com/index.en-gb.html"
                 />
    <xhtml:link 
                 rel="alternate"
                 hreflang="de"
                 href="http://www.booking.com/index.de.html"
                 />
     <xhtml:link 
                 rel="alternate"
                 hreflang="nl"
                 href="http://www.booking.com/index.nl.html"
                 />            
  </url>
</urlset>

The above example seems quite easy and simple, although how do we scale this process in order to build page-by-page sitemaps with hreflang tags for huge websites?

Building hreflang sitemaps with Excel

The following process is an easy and quick way to generate your hreflang XML Sitemaps with the help of Microsoft Excel plus some other tools.

Step 1 – Gather all the URLs

First of all, you need to have a list of all the live URLs of your website that are crawlable from search engines. Personally, I’m a big fan of Screaming Frog crawling software. Screaming Frog allows you to crawl your website by selecting as user agent “Googlebot” and by respecting your current robots.txt file. The crawler can easily handle up to 100.000 URLs (deactivate the crawling of images, JS, CSS and SWF files to facilitate crawling. If you have the RAM bandwidth, you can increase the memory allocation according to your system setup – personally I’ve managed to crawl up to 400.000 URLs by allocating 8GB of ram to the software.

There are many other cloud services in the market that can help you crawl fully your websites, especially if they ‘re bigger than the capabilities of Screaming Frog. A few that I’ve used and captured my interest are DeepCrawl, Botify and 80legs.

If you ‘re a more advanced and techy user, you can always use some custom web crawler libraries like Scrapy – but if you ‘re that nerdy, I don’t think you need to keep reading this post.

So in my case, with the usage of Screaming Frog for crawling, I did the following setup:

Setting up the spider settings

Screaming_Frog_1

Setting up regular expressions to limit the results by language

On Configuration -> Include menu of Screaming Frog, you can limit the crawling according to URL patterns. Booking.com URLs end up with a common pattern for each language / translation. For German language for example, the URLs include the pattern “de.html” – so the regular expression on Screaming Frog should look like:

Screaming_Frog_2

Step 2 – Organize your data and match the URLs

Once you are done with your crawling, it’s time to organize your data into your spreadsheet and match the URLs, according to the language / region each URL is referring too.

The recommended setup is one column per country / region, as you can see below:

Crawling_xlsx

Now that you have your crawling data organized into columns, it’s time to match the URLs. If your URL format is like booking.com, it’s quite to easy to match the URLs, since the URLs are identical among the languages with the usage of filters and / or sorting of the cells.

Alternatively, if you want to match and sort the URL strings you can use the Like method in Excel when you need to use regular expressions to match strings in different columns or the MATCH function if you just need to sort your data.

Creating the hreflang sitemap on Excel

Once you have your data sorted into columns with equivalent URLs aligned in the same line, now it’s time to use excel in order to build the sitemap.

Step 1 – Add additional columns on the first row

On our example, we will place the default URLs (those for global targetting) on Column A. Now we have already placed data on Column A, B, C and D. We need to add the following elements on the first line of the following columns:

  • Column E: <url>
  • Column F: <loc>
  • Column G: </loc>
  • Column H: <xhtml:link rel=”alternate”
  • Column I: hreflang=”de”
  • Column J: hreflang=”nl”
  • Column K: hreflang=”en-GB”
  • Column L: href=”
  • Column M:  ” />
  • Column N: </url>

So the spread sheet should like like this:

spreadsheet-hreflang-sitemap

On Column O, we can start concatinating the cells in order to produce the core of the hreflang XML sitemap. The function that will form the hreflang sitemap lines on Column O, should be:
[highlight]=E1&F1&A2&G1&H1&” “&I1&” “&B2&M1&H1&” “&J1&” “&C2&M1&H1&” “&K1&” “&D2&M1&N1[/highlight]

..and just drag the function until the end of the lines, according to your filesize.

Then simply copy the Column O, paste it in a new spread sheet (by keeping the text values only) and save the file as .xml.

Then open the .xml file with a text editor and add the remaining values for the sitemap at the beginning and the end of the code.

Validate an hreflang XML sitemap with Excel

Once you ‘re done with the creation of XML hreflang sitemap, Excel can help you again to validate the sitemap and check it for errors.

In order to do the health check, you have to upload the XML file on your website. Then on Excel, follow the next steps:

  1. Go to Data tab
  2. Select Other Sources -> From XML Data Import
  3. In the pop-up window, paste the URL where your hreflang sitemap is located and click “Yes” on the next dialogue
  4. Your sitemap should appear on your spread sheet

That’s more or less the process, feel free to drop a comment below if you have any questions!

Leave a Reply

Your email address will not be published. Required fields are marked *