A Simple HTML / CSS Parser With Objective-C

One of the biggest challenges of building ShopLater, an app that gets you the latest prices for products you love, was figuring out how to parse the HTML from a given retailer’s product page to get the product’s price, image, and title. With Ruby, I would simply use the amazing nokogiri gem, where I can simply put in a CSS selector, and it’ll find me the information between the specified tags.

After trying out the awful HPPLE library, we knew that we needed a different solutions. The problem with HPPLE is that it parses using XPATH, which is very specific. So if the retailer changed a random div on a page, our system would break. Parsing by CSS selectors is a lot more reliable, since it’s not as likely the class name for a price would change very often.

The idea for our parser came from a very simple premise. The HTML on a page is a string. And with strings, you can do things like use regular expressions and, more interestingly as we discovered, the NSScanner class. Here is our simple CSS parser:

+ (NSString *)scanString:(NSString *)string
                startTag:(NSString *)startTag
                  endTag:(NSString *)endTag
{

    NSString* scanString = @"";

    if (string.length > 0) {

        NSScanner* scanner = [[NSScanner alloc] initWithString:string];

        @try {
            [scanner scanUpToString:startTag intoString:nil];
            scanner.scanLocation += [startTag length];
            [scanner scanUpToString:endTag intoString:&scanString];
        }
        @catch (NSException *exception) {
            return nil;
        }
        @finally {
            return scanString;
        }

    }

    return scanString;

}

So, to use the above method, simply pass in your HTML string, and the tags between which your target is located, and it will return back a string between your start and end tag. So for example, if the product’s price is as follows on the page as it is on Macy’s website:

<span class="priceSale">Now $79.99</span>

You simply pass in the html string for the macy’s product page with the start and end tag:

[Parser scanString:macysHTMLString
          startTag:(NSString *)@"<span class="priceSale">"
            endTag:(NSString *)@"</span>"];

You’ll get back @”Now $79.99″ as the result string.

For more complicated HTML, where the price might be in a random span with no identifier, you simply pass in the highest level of unique start and end tags, get back a new string with the extra tags, then keep passing the result string into the scanner with the more specific tags to narrow down the item you need.

Enjoy the article? Join over 14,500+ Swift developers and enthusiasts who get my weekly updates.

  • Michael Kozono

    How do you plan to efficiently deal with inevitable website changes? Welcome to the world of scraping 🙂

    Also, I’m not familiar with hpple but xpath can be used for selecting anything in an XML doc; nokogiri converts css to xpath. In any case, I’d probably use mostly regular expressions for this stuff anyway.

    • I guess we weren’t using XPath correctly! Much happier with this short solution though 🙂 The HPPLE library had a few files in it, and on iOS, you need as little bagage as possible!

  • Fernando Rosentalski

    Nice approach, just what i needed

  • applefreakruben

    [Parser scanString:macysHTMLString
    startTag:(NSString *)@””
    endTag:(NSString *)@””];

    What is the “Parser” in here? I got an error.

    • Parser is just my class that has the scanString method.

      • Mikke

        and what does that class look like? 🙂

  • Luthan T. Hill

    Natasha,
    Wonderful blog. Thanks for all the great resources. Quick question. If you’re a beginning developer, is Objective-C too steep a learning curve to dive into? Would you suggest starting with something simpler?

    • Hi Luthan,

      Objective-C definitely has a steep learning code, but it’s soooo much fun!!! I can’t believe I get paid to do this!

      The main issue with learning it as a beginner is a lack of beginner resources. I know when I was first starting out, after I did CS106A, I had a pretty hard time getting a hang out it, so I worked on Ruby on Rails.

      Ultimately, all programming languages have the same concepts, so it doesn’t matter as much what you learn first, as long as there are good resources available. Then you can more easily apply those concepts to new languages. Knowing Ruby on Rails has definitely helped me understand a lot about architecting my code in iOS.

      Not sure if this helps, but I guess learn whatever has the most resources for your learning style, and then move on to Objective-C.

  • Kimpeh

    Thanks. I finally solved my problem thanks to your solution. I have been struggling with HPPLE for a week.

  • prexxi

    how does extracting the value of attribute works ? For e.g. src of img tag

  • Indra

    Thanks for the solution, it really helped me