Testing a WordPress URL problem

In monitoring the logs for this web site, I’ve noticed a lot of weird URLs with invalid parameters like ‘loginid’ and ‘commentid’. At first I ignored them, because those parameters don’t do anything and are essentially ignored by WordPress.

But the volume of these strange requests grew to the point where I started to wonder what was going on. It turns out that although WordPress ignores invalid URL parameters, it also – in some cases – returns those invalid parameters in page content. If you go to the home page of boot13.com, and add ‘/?blahblah’ to the end of the URL, then hover your mouse over the ‘Older posts’ link at the bottom of the resulting page, it will show ‘/?blahblah’.

The fact that WordPress echoes arbitrary parameters in itself isn’t a huge problem. And most web crawlers are smart enough to recognize that the spurious parameters don’t correspond to unique pages on the site, so they are ignored automatically. That includes Googlebot. But some crawlers, in particular Bing’s crawler and the MJ12bot crawler, see every URL that includes any arbitrary parameters as a unique URL, and indexes them accordingly.

This produces a lot of clutter in Bing’s search results for boot13, and the information provided by Bing Webmaster Tools is filled with these bogus URLs. And that’s annoying.

I’ve taken several steps to try to reduce this clutter. I used robots.txt to tell crawlers to ignore any URL with ‘loginid’ or ‘commentid’. Using Bing Webmaster Tools, I told bingbot to ignore those parameters. As a result, Bing’s search results and site data are looking a lot better. But while most crawlers honour robots.txt, some don’t. In particular, some MJ12bot nodes clearly ignore robots.txt. These may be rogue MJ12bot nodes, or those nodes may be misconfigured in some way.

Now I’m trying to determine just how much of a problem this really is. I decided to see if I could introduce some arbitrary text into the search results and related data for another WordPress site (one not owned or managed by me).

Here’s a link to the UPS blog. That site runs on WordPress, and it exhibits the same behaviour I’ve been seeing on boot13. The URL in the first sentence of this paragraph contains a special, unique parameter. The idea is to see what happens when the URL is crawled by Bingbot. Will my special parameter show up in the search results for the UPS blog? I’ll update this post as I learn more.

Update 2015Jan30: The parameter is now appearing in Google site search results for the UPS blog! There are at least 79 entries, most of which are actually duplicates, as I write this. Still nothing in Bing’s search results.

Update 2015Jan31: I checked the WordPress bug tracking system to see if anyone had reported this previously. They had. I ended up re-opening an existing ticket and adding my observations. Hopefully this will lead to a fix!

About jrivett

Jeff Rivett has worked with and written about computers since the early 1980s. His first computer was an Apple II+, built by his father and heavily customized. Jeff's writing appeared in Computist Magazine in the 1980s, and he created and sold a game utility (Ultimaker 2, reviewed in the December 1983 Washington Apple Pi Journal) to international markets during the same period. Proceeds from writing, software sales, and contract programming gigs paid his way through university, earning him a Bachelor of Science (Computer Science) degree at UWO. Jeff went on to work as a programmer, sysadmin, and manager in various industries. There's more on the About page, and on the Jeff Rivett Consulting site.

Leave a Reply

Your email address will not be published. Required fields are marked *