A Simple HTML / CSS Parser With Objective-C
One of the biggest challenges of building ShopLater, an app that gets you the latest prices for products you love, was figuring out how to parse the HTML from a given retailer’s product page to get the product’s price, image, and title. With Ruby, I would simply use the amazing nokogiri gem, where I can simply put in a CSS selector, and it’ll find me the information between the specified tags.
After trying out the awful HPPLE library, we knew that we needed a different solutions. The problem with HPPLE is that it parses using XPATH, which is very specific. So if the retailer changed a random div on a page, our system would break. Parsing by CSS selectors is a lot more reliable, since it’s not as likely the class name for a price would change very often.
The idea for our parser came from a very simple premise. The HTML on a page is a string. And with strings, you can do things like use regular expressions and, more interestingly as we discovered, the NSScanner class. Here is our simple CSS parser:
+ (NSString *)scanString:(NSString *)string startTag:(NSString *)startTag endTag:(NSString *)endTag { NSString* scanString = @""; if (string.length > 0) { NSScanner* scanner = [[NSScanner alloc] initWithString:string]; @try { [scanner scanUpToString:startTag intoString:nil]; scanner.scanLocation += [startTag length]; [scanner scanUpToString:endTag intoString:&scanString]; } @catch (NSException *exception) { return nil; } @finally { return scanString; } } return scanString; }
So, to use the above method, simply pass in your HTML string, and the tags between which your target is located, and it will return back a string between your start and end tag. So for example, if the product’s price is as follows on the page as it is on Macy’s website:
<span class="priceSale">Now $79.99</span>
You simply pass in the html string for the macy’s product page with the start and end tag:
[Parser scanString:macysHTMLString startTag:(NSString *)@"<span class="priceSale">" endTag:(NSString *)@"</span>"];
You’ll get back @”Now $79.99″ as the result string.
For more complicated HTML, where the price might be in a random span with no identifier, you simply pass in the highest level of unique start and end tags, get back a new string with the extra tags, then keep passing the result string into the scanner with the more specific tags to narrow down the item you need.