Tuesday, January 25, 2005
Google video search - for closed caption info
Google Video is now available, which allows searching of the closed captioning (aka subtitles) text of TV broadcasts. (As such, it seems poorly named - Google TV Search would be more appropriate. Other video search services, such as Yahoo video do indeed search for video clips on the web).
Currently the beta service carries results from PBS, Fox News, CSPAN and CSPAN2, plus 4 local San Francisco stations. The database is very small - they have only been running the capturing since late December 2004, which seems to have given them just around 5500 shows. I arrived at this number by adding the number of results in a search for "news" added to the number for "-news" - you get similar results for other words which seems to validate the approach. Searching for extremely common words, such as the stop words "the" or "a" gives an error page, though "and" gives 4500 results. Unfortunately "www" is also a stop word, so it's tricky to see what URLs have broadcast recently in the captions.
The capture of the shows is no doubt automatic, and seems to suffer from garbled text. Shows seem to be added to the index in near real time, within about an hour of the show ending. Whilst it's not possible to view actual videos of the show, there are single frame thumbnail captures shown with the results. The capturing includes the "♫" symbol, which is used to represent music or singing in the show, but unfortunately you cant search for that symbol - in fact even regular symbols such as "@", "#", "$" or "&" give the same error page.
Interestingly, there does not seem to be a limit of 32 words in the query string - I searched for a query with 40+ words or'd together with the | operator, and got no complaint from the system, and results that (separately) included my last word in the sequence, as well as the first word.
Currently the beta service carries results from PBS, Fox News, CSPAN and CSPAN2, plus 4 local San Francisco stations. The database is very small - they have only been running the capturing since late December 2004, which seems to have given them just around 5500 shows. I arrived at this number by adding the number of results in a search for "news" added to the number for "-news" - you get similar results for other words which seems to validate the approach. Searching for extremely common words, such as the stop words "the" or "a" gives an error page, though "and" gives 4500 results. Unfortunately "www" is also a stop word, so it's tricky to see what URLs have broadcast recently in the captions.
The capture of the shows is no doubt automatic, and seems to suffer from garbled text. Shows seem to be added to the index in near real time, within about an hour of the show ending. Whilst it's not possible to view actual videos of the show, there are single frame thumbnail captures shown with the results. The capturing includes the "♫" symbol, which is used to represent music or singing in the show, but unfortunately you cant search for that symbol - in fact even regular symbols such as "@", "#", "$" or "&" give the same error page.
Interestingly, there does not seem to be a limit of 32 words in the query string - I searched for a query with 40+ words or'd together with the | operator, and got no complaint from the system, and results that (separately) included my last word in the sequence, as well as the first word.