Wednesday, August 27, 2008

To Scrape or Not To Scrape?


We've been toying around with the idea of screen scraping for some time now. The main reason, wasn't to avoid entering a debate about whether it's fair or not, but rather because we weren't sure we even wanted to deal with hosting any content of our own.

When I say hosting content, I'm referring to large amounts of content (videos, photos, documents, etc). Not simple things like user information and a few photos. I'm talking YouTube. It reminds me a bit of the discussion that took place at my last startup. To hardware or not to hardware. Hardware is a beast I would avoid at all costs - unless of course that's your thing. We weren't a hardware company. Yet we forged ahead designing some hardware for our product (electronic ticketing). I didn't know that this was a problem at the time, nor was there much I could have done about it anyway. But an MBA and some perspective later... hardware was one of our downfalls. It was a money pit. The design, implementation, and rollout were all overly expensive. Especially when you have a company like IDEO design things. And we built this big bulky system. I mean, there were probably a few smart ways we could have made hardware work to our advantage. But we weren't hardware people - so we should have left it for them.

Back to Vyoo. We were playing around with hosting our own reviews and recommendations, photos, vidoes, etc. But then we realized that it wasn't our core competency. Why rebuild the wheel? So we decided to link out to other sites or to try to partner with sites in order to allow users to find this preexisting content. Well, it's hard to forge relationships before you have a finished product and we wanted to prove out our concept. So we decided to scrape some screens to pull some data into our site. We only provide snippets and we link out to the original sites. But it's a more user friendly experience for our users.

This decision made me think about the implications of screen scraping. Initially, I had been opposed to the idea. I felt like people were just duplicating sites. Where was the innovation, originality, or hard work? And in that sense, I still believe screen scraping is a poor choice for copycats. Bring something new to the table! Or use snippets of the data to provide enhanced value (Digg). But then how about user generated content? First of all, that's where I think startups can provide quite a bit of value - for the user's themselves. And after all, this is our information isn't it? If I leave profile information on Facebook. Well, that's my information. Facebook didn't go to Haas. It didn't take my picture for me. It only let me use their service. And it's getting something out of that by using that information (targeted ad revenue). If I put my clothes in a locker at the gym. Those are still my clothes. Sure the gym owns the locker, but they don't own my clothes. This is a much larger discussion - with lots of nuanced ways in which I would take a different approach to this question of ownership.

Anyway, that's where I stand. With permission (if it's user generated), sites should be able to leverage a user's information in a unique way. If it's not user generated, well, providing snippets and sending traffic to the original site is totally cool. Again, Digg comes to mind.

1 comment:

Anonymous said...

"When I say hosting content, I'm referring to large amounts of content (videos, photos, documents, etc)."


Be prepared for handling a large number of DMCA takedowns.