Vim Search and Replace: Grabbing Image URLs from HTML source code

The following is a quick and dirty way of pulling a lot of URLs out of a given pages source code, using two commands in vim, my new favourite text editor. So, right to the point!

:v/jpg/d
:%s/^.*src="\(http:.\{-}jpg\)".*/\1/g

Try it out right now on the source code of Imgur’s /r/ScarlettJohansson’s page

I’m not well versed enough in vim to go into an indepth explanation of why the above commands work beyond what they actually do, so forgive any extreme lack of detail.

:v/jpg/d

Calls the v command, which is the opposite of the g command. What v does it go through all the lines in the current file or buffer, and mark all the lines that don’t contain the string jpg. For this example it works really well, but there are times where this isn’t nearly exact enough, but a line might have a phrase with jpg in it sort of like this one does, and it’ll not get deleted. So that’s something to keep in mind. For now though, it’s just what we need.

So now, the code should look something like this, with all the lines that didn’t contain jpg, that is to say, all the lines that don’t contain a URL that points to an image. The next step is a bit more complicated, in that we use several of regex’s metacharacters.

%s/^.*src="\(http:.\{-}jpg\)".*/\1/g
looks nasty but if you break it down it’s a bit easier:
^.*src="
matches everything from the start to the line until src=" and then I match
http:.\{-}.jpg
to a group. What that does it finds as little text as it can between http: and .jpg and then saves it to a group, bu surrounding it with parentheses, which’ll allow the grouped pattern to be replaced later on.

Lastly, the tail part of the command:
".*/\1/g
matches an apostrophe, " and the rest of the line with .* which means find as many characters as you can until you can’t (the end of the line, in our case) and then replaces it with \1 where \1 stands for the first group that we match. In our case, \1 stands for the URL we found.

This is all you need, or at least will all you need to pull out the image URLs from an HTML source. Using our example from earlier, we’ll arrive at something like this, which is just each URL on it’s own line. You could call it a day.

But we’re not going to, since we’re familiar with the Imgur Api, and we know that Imgur appends its images with a size, either s, l or b among others, and looking at our images, it’s clear that we only have a smaller set of images.

So what we need to do is replace all instances of b.jpg with .jpg. That’ll remove the b and allow us to see the full sized image. The command to do this is the following:

:%s/b\.jpg/.jpg

Where we use :%s to substitute every line in the file or buffer that matches the b\.jpg pattern. We have to escape the period because vim uses it as a wildcard character otherwise. Finally, we replace the matched pattern with .jpg; and we’re left with the resulting set of URLs

Not too shabby eh, considering it’s only 3 relatively simple lines in vim. You could always turn this into a macro so you can easily take any source page and pull out the jpgs, but that’s for another day.

Leave a Reply

Your email address will not be published.