How to Find All Existing and Archived URLs on a web site

There are several causes you may perhaps require to discover each of the URLs on an internet site, but your actual purpose will establish That which you’re hunting for. For illustration, you might want to:

Establish just about every indexed URL to research troubles like cannibalization or index bloat
Acquire existing and historic URLs Google has observed, especially for website migrations
Find all 404 URLs to Get better from write-up-migration glitches
In Just about every circumstance, just one Instrument received’t Provide you all the things you require. However, Google Lookup Console isn’t exhaustive, along with a “site:case in point.com” research is limited and difficult to extract details from.

With this submit, I’ll walk you thru some equipment to construct your URL list and in advance of deduplicating the information using a spreadsheet or Jupyter Notebook, based upon your web site’s dimension.

Aged sitemaps and crawl exports
When you’re trying to find URLs that disappeared from the Are living internet site lately, there’s an opportunity somebody on your staff might have saved a sitemap file or a crawl export prior to the modifications were being created. In the event you haven’t by now, check for these data files; they're able to often present what you'll need. But, for those who’re looking through this, you almost certainly did not get so Blessed.

Archive.org
Archive.org
Archive.org is an invaluable Software for Search engine optimisation tasks, funded by donations. When you look for a website and select the “URLs” selection, you can accessibility nearly 10,000 stated URLs.

Nonetheless, Here are a few limitations:

URL Restrict: You are able to only retrieve as many as web designer kuala lumpur ten,000 URLs, which can be inadequate for larger web sites.
Excellent: Many URLs could possibly be malformed or reference useful resource documents (e.g., images or scripts).
No export choice: There isn’t a created-in solution to export the list.
To bypass The shortage of the export button, utilize a browser scraping plugin like Dataminer.io. However, these restrictions indicate Archive.org may well not provide a complete Remedy for much larger web pages. Also, Archive.org doesn’t indicate no matter whether Google indexed a URL—however, if Archive.org uncovered it, there’s a good likelihood Google did, way too.

Moz Professional
Though you might usually make use of a backlink index to discover exterior internet sites linking to you, these applications also find out URLs on your site in the procedure.


Tips on how to use it:
Export your inbound backlinks in Moz Professional to acquire a rapid and easy listing of target URLs from a web-site. In case you’re managing a massive Web page, consider using the Moz API to export details outside of what’s workable in Excel or Google Sheets.

It’s essential to Take note that Moz Pro doesn’t ensure if URLs are indexed or uncovered by Google. Nevertheless, since most internet sites apply precisely the same robots.txt guidelines to Moz’s bots because they do to Google’s, this process frequently operates perfectly like a proxy for Googlebot’s discoverability.

Google Research Console
Google Research Console offers numerous precious resources for building your listing of URLs.

Hyperlinks reviews:


Just like Moz Pro, the Back links part presents exportable lists of concentrate on URLs. Sadly, these exports are capped at 1,000 URLs Every. It is possible to use filters for distinct webpages, but considering the fact that filters don’t implement towards the export, you might should depend on browser scraping applications—limited to five hundred filtered URLs at any given time. Not best.

Overall performance → Search engine results:


This export offers you a listing of internet pages getting research impressions. Whilst the export is restricted, You should use Google Research Console API for larger datasets. You will also find free of charge Google Sheets plugins that simplify pulling extra substantial facts.

Indexing → Internet pages report:


This segment offers exports filtered by difficulty kind, however these are also limited in scope.

Google Analytics
Google Analytics
The Engagement → Pages and Screens default report in GA4 is an excellent source for collecting URLs, having a generous limit of 100,000 URLs.


Even better, you are able to apply filters to make distinct URL lists, properly surpassing the 100k limit. As an example, if you'd like to export only blog site URLs, observe these methods:

Action 1: Add a phase to your report

Phase 2: Simply click “Produce a new phase.”


Action 3: Determine the section having a narrower URL sample, for example URLs containing /blog/


Take note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer valuable insights.

Server log documents
Server or CDN log files are Potentially the ultimate Software at your disposal. These logs capture an exhaustive listing of each URL path queried by buyers, Googlebot, or other bots throughout the recorded period of time.

Factors:

Details measurement: Log data files could be significant, numerous web pages only retain the last two weeks of information.
Complexity: Examining log data files may be hard, but numerous equipment are offered to simplify the procedure.
Combine, and excellent luck
Once you’ve gathered URLs from each one of these sources, it’s time to combine them. If your website is small enough, use Excel or, for much larger datasets, applications like Google Sheets or Jupyter Notebook. Make sure all URLs are persistently formatted, then deduplicate the listing.

And voilà—you now have an extensive list of existing, outdated, and archived URLs. Fantastic luck!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Comments on “How to Find All Existing and Archived URLs on a web site”

Leave a Reply

Gravatar