My latest coding project has been taking up all of my free time lately. Honestly though, that’s not a whole lot of time. I’m typically super busy and my free time consists of
- my morning light rail commute to downtown (about 45 minutes)
- my evening light rail commute home (about 45 minutes)
- any time I can carve out by staying up late or getting up early (about 30 minutes)
But considering how much other stuff I do on a daily basis, two hours of coding is pretty crazy.
Anyway, I’m working on a parsing/crawling tool for SEO application.* Here’s the backstory: My boss and I were working on an SEO project in which we’d need to provide page copy recommendations for a client’s website. There were close to 100 pages on the website that would all have to be copied into a Word doc (client’s choice) so we could make recommendations using the Track Changes feature.
The problem was that each page was formatted differently, with multiple DIVs containing text that was interspersed with images, forms, videos, etc. Copying and pasting straight from the site made for a super messy document and, quite frankly, it wasn’t something you’d want to shoot over to a client.
So, I thought, “If there were a way to just grab everything that was wrapped in a <p> tag within the main content area, we could just use that text instead of copying straight from the webpage.” What I was essentially looking for was a way to take a webpage, strip out all the formatting/styling, images, and videos and only get what I needed; the raw text of the page. SEO tools that simulate the web crawler’s experience were still too sloppy for what I needed; they returned image alt tags, JS snippets, and other tidbits as well.
After realizing that the tool I wanted didn’t exist (as far as I could find) and that there were no resources to have our in-house developer build something out, I decided to take it on myself. It had been a while since I’d done anything this complicated (actually, I’d never done anything on quite this level). The most complicated thing I’d done in the last year was building some functionality to change a PPC phone number using PHP in a site’s global header.
So, about two weeks later, I’m about done with the tool. It essentially works like this:
- You pass in the page URL for the content you want to strip.
- Optionally, specify if you only want the content from a specific area (DIV Id or DIV class).
- Optionally, specify which elements you want pulled: paragraphs, headings, images, or (the extremely greedy) “any text”.
- After you submit, it runs a PHP function that runs a query using a DOM (Document Object Model) library and pulls out everything you specified.
That’s a pretty simplified breakdown of it, but you get the picture. My biggest challenge was allowing the tool to dynamically build the query. Here’s an example of the query that’s used: div[class=article] p, div[class=article] h3. This would pull back every paragraph and every H3 heading. The part that was tricky for me was piecing together this query based on the form input. What I ended up doing was pulling all the options from the form and assigning those to PHP variables. I referred to this section, div[class=article], as $level1 since it pretty much defines the scope and is used repeatedly in the query building. Then I needed to develop a way to build a query by using $level1, adding a specific element (like p tags), spitting out $level1 again, adding another element, and so on, for as many elements were selected in the form. I ended up using some if statements to say, if a given element is checked in the form, add that element to an array that keeps track of everything that needs to be pulled. So, if you checked h1, h2, and p, the array would contain H1, H2, P. Then, I used a foreach loop to cycle through the array and for each value, add to $level1 and the element to a $whatToFind variable. $whatToFind became my new, dynamic query that I used in my DOM function.
That may have been super confusing. Anyway, I want to launch the tool soon, but I hesitate to put it out there until it was a better UI. Stay tuned for updates on the tool and feel free to hit me up with name suggestions or ideas for the UI.
* I’m very torn up about whether to call it a “parser” or a “crawler”. I have a feeling my tool doesn’t fit into the definition of “parser”, so I hesitate to stick with that name. However, I don’t want to call it a “crawler” because a) the term “crawler” implies that the tool crawls pages in the same manner as a search engine crawler, which it does not, and b) I want a more unique name and descriptive name. “HTML Page Stripper”? What’s the search volume on that?