How we parsed data from 1000s of restaurant menus
Restaurant menus online are a mess. They are embedded in Adobe Flash websites, uploaded as Word document or somewhere on a website in an unreadable format. We’ve fixed that.
In the last month we’ve indexed and parsed over 1000 menus into a structured fully searchable format. This article describes how we did it taking the following steps:
- Finding and scraping menus
- Parsing the menu content
- Turning menus into MenuMarkup
Scraping the Internet for menus
Thanks to YelloYello we have access to a large, constantly growing and updating, database of local businesses. Our web-spider takes all restaurant listings in this database via the YelloYello API and:
- Finds the website of the business
Most information is readily available from YelloYello. If we don’t find the website we do a ‘best guess’ using famous search engines 😊.
- Searches the website for relevant content
This might be HTML pages containing menu data, PDF, Word or image documents, etc.
Parsing menu content
Next we take this content and turn in into plain text (or sometimes a very basic HTML). This is done by grabbing the text out of PDF of Word files. If this is not supported we turn them into images and do some OCR (the Open Source Tesseract OCR does a great job).
Turning the content into MenuMarkup
To structure this text into different elements of a menu (or regular pricelist) we developed a very simple but effective MenuMarkup (see Wikipedia if you’re unfamiliar with the term ‘markup’).
Having the plain-text information we’ve got some hand-crafted stuff to turn it into MenuMarkup. Sometimes however, the reviewed outcome (the menu) is not up-to our standards and we do some manual processing to fix the details.
And boom! there you have it.. now we can do some awesome searches like:
- what is the average price of a steak in Amsterdam?
- where can we buy the cheapest pizza?
- where can I drink champaign and eat caviar?
- where can I eat a vegetarian schnitzel? (hmm..)
Interested in this data? Contact us!