Monday, January 31, 2005

 

Tool for Thought - Steven Johnson writes in the NYT

Steven Johnson had an essay in yesterdays NYT entitled "Tool for Thought" in which he writes about his use of an collection of his own writings, together with selected passages from books he's read (entered by hand, with the help of a research assistant!) , which togther serve as "an archive of all my old ideas, and the ideas that have influenced me." By searching this, he's certainly able to find old information he knew was there, but it also serves to find information he didn't know he was looking for!

The software he uses to access this archive is DEVONthink (a Mac only product), which is able to make connections between articles based on content, so it is able to suggest other articles, even without there being direct matching keywords in those articles. As a writer, both of articles and books, this chaining together of the archive fragments helps Steve very much in coming up with ideas, finding connections that would not have occured to him in the first place.

Steve goes into a lot more technical detail on the process in his own blog.

Since the data was assembled by Steve in the first place, the data is focused in his direction, and the results are very relevant.

Steve also makes the very significant point that the fragments and articles are also generally quite small - of the order of up to 500 words, and this seems to be a very good size for these connections to come out. On the web, many pages are around this sort of size (or at least the useful information once all the navigational markup has been removed is around this size), but moving to the "conventional" desktop search that Google et al have released recently, we are suddenly searching potentially huge locally produced documents - and the search breaks down at this point. (Even though the better desktop search tools are able to give a preview of their results showing the context around the matching word, I know of no tools that allow you to say that you want the best matching paragraphs within the document, rather than simply the paragraphs that match in the document which is somehow ranked highly across the whole document).

For Windows users, the comments on Steve's blog suggest that askSam (which describes itself as a free form database) may be worth investigating as an alternative to DEVONthink. Other programs that perhaps have some similarity include

Thursday, January 27, 2005

 

Squared Circle - Experimental Image Finder

Jim Bumgardner has produced a interesting tool which searches for images via colour. The current implementation is in Flash, and via a colour wheel and intensity slider allows the user to select a colour for which matching images are displayed.

The images used in the demo are all acquired from the Squared Circle group at flickr, which makes the demo all the more impressive. By concentrating on a group of images that are all the same format (square!) and of a diverse but themed subject (a circle that fills the image), this search mechanism can really concentrate on the aspect that does differ widely across the group - their colour.

Labels:


Wednesday, January 26, 2005

 

Relaxation Ranking Passage Retrieval

Via Tim Bray, I discovered Steve Green's blog. Steve works at Sun, and is the Principal Investigator for the Advanced Search Technologies project in Sun Labs.

Sun Labs are working on Passage Search, using an technique called Relaxation Ranking Passage Retrieval which Steve provides an easy to follow explanation of. Basically it works by looking for passages that perfectly match the query, and assigning a penalty whenever the passage match is less than perfect - for example when words are missing, misordered, or separated by intervening words. The advantage is that in the absence of exact matches, "close" matches are found with no further work on refining the search query, and the penalty numbers provide a ranking from best to worst matches.

Tuesday, January 25, 2005

 

Google video search - for closed caption info

Google Video is now available, which allows searching of the closed captioning (aka subtitles) text of TV broadcasts. (As such, it seems poorly named - Google TV Search would be more appropriate. Other video search services, such as Yahoo video do indeed search for video clips on the web).

Currently the beta service carries results from PBS, Fox News, CSPAN and CSPAN2, plus 4 local San Francisco stations. The database is very small - they have only been running the capturing since late December 2004, which seems to have given them just around 5500 shows. I arrived at this number by adding the number of results in a search for "news" added to the number for "-news" - you get similar results for other words which seems to validate the approach. Searching for extremely common words, such as the stop words "the" or "a" gives an error page, though "and" gives 4500 results. Unfortunately "www" is also a stop word, so it's tricky to see what URLs have broadcast recently in the captions.

The capture of the shows is no doubt automatic, and seems to suffer from garbled text. Shows seem to be added to the index in near real time, within about an hour of the show ending. Whilst it's not possible to view actual videos of the show, there are single frame thumbnail captures shown with the results. The capturing includes the "♫" symbol, which is used to represent music or singing in the show, but unfortunately you cant search for that symbol - in fact even regular symbols such as "@", "#", "$" or "&" give the same error page.

Interestingly, there does not seem to be a limit of 32 words in the query string - I searched for a query with 40+ words or'd together with the | operator, and got no complaint from the system, and results that (separately) included my last word in the sequence, as well as the first word.

Monday, January 24, 2005

 

New features of AOL search

AOL have launched their new search which takes the now tried and tested route of offering Google search results, with a layer of additions on top.

Amongst the layered features are:

 

Slashdot discussion on new Google query limits

Slashdot picked up on the news that Google has raised the permitted number of words in a query from 10 to 32 words.

By avoiding being posted on the Slashdot front page, only appearing in the dedicated Google category, the news was greeted by a mere 50 comments (compared to the order of magnitude greater number a front page article can expect), but that also means that those comments are generally of a higher quality.

Amongst the interesting posts are comments which note:
Other comments I've seen around on blogs etc include
It's also worth noting that the word OR (or the | symbol which is equivalent) which as you might expect does a boolean OR query rather than the default AND query, do not count towards this limit.

 

Ben Goodger, Firefox lead, now works for Google

My last post was about checking the blogs of Ben Goodger and Blake Ross, who together have made huge contributions to the success of Firefox. I noted that there wasn't much of interest on either blog at the moment.

Today that changed - though reading the news in my RSS reader, the headline of "Changes" was scarcely one that foretold the major news contained in the post. The news is that, since two weeks ago, Ben Goodger has been employed by Google, rather than The Mozilla Foundation. In practice, his job remains as it was, but with Google paying its lead developer, we can be sure that new and interesting overlap between the Mozilla Foundation's browsers and Google's services are sure to develop.

Sunday, January 23, 2005

 

I guess Blake and Ben have been rather busy recently

It often pays to go direct to the source, so I checked out the blogs of both Blake Ross, Firefox creator, (and Wired cover model) and Ben Goodger, Firefox lead developer.

Neither is currently particularly enlightening - Blake's blog is just 2 days old, and has 2 entries in total, whereas Ben's blog carries a 1997-2004 copyright, and has been verified to work in Mozilla 1.2!

Oh well, I guess they've been too wrapped up in shipping Firefox 1.0 to bother with blogging these past few months. Whilst I applaud the achievements of Firefox, I generally stick with the more powerful browser that forms part of the Mozilla suite.

Saturday, January 22, 2005

 

Google search limit raised to 32 words

Google appears to have silently raised the number of words it permits in its search query from 10 to 32 words.

This is great news - whilst the casual user wont notice the difference, the power searcher was often hitting this limit. Now it's possible to do much more targeted searches, using the - operator much more to exclude words you know are of no interest. There's a whole bunch of other applications that benefit from raising the limit, and plenty of partial workarounds that are now redundant - a Google search shows many of them.

The support for the new limit is somewhat patchy :-
The only other mention I've seen of this new limit is at ResearchBuzz - there is no mention of it yet on the Google site.

 

Picasa2 - No pictures found

I've held off talking about Picasa2 for a few days, in order to give me time to investigate it fairly thoroughly.

First off I should say that when it works, I'm quite impressed with it. A lot of work has been put into the look and feel of the program, and its underlying features and capabilities are impressive, and a huge step up from the previous version.

Things I particularly like about it include:

The showstopping bugs

However, you may have seen the qualification I made of "when it works". Unlike many recent tool releases from Google, or indeed many of the other search vendors, Picasa2 does not carry the "beta" designation, but is supposedly fully finished. However, my experience has been that this is far from the case, and I have had considerable problems running the program.

On my machine I have thousands of images - mainly a combination of digital camera images, scanned images, and web sourced images. The first two of these categories mean I have large groups of very similar format images, but the latter group means I have images that have been produced in many different ways, by different producers using different software packages.


No pictures found when scanning has completed

When Picasa scans my disk to find these images, it spends hours processing, at then end of which it reports "No pictures found"! Whilst it's doing this scanning I can of course still use the program, and it does indeed show me many images it has found, but if I leave it long enough to complete the scan, it somehow forgets all the images it has already discovered. My guess is that somewhere in the web sourced images is an image it does not understand, and that it manages to corrupt its image database in such a way that it can no longer use it to access any images. (Database corruption seems to be an issue of which the Picasa developers are well aware since they provide several answers about it in the Picasa knowledge base). This was not a one off occurrence - I ran the program until it completed its scan a number of times, clearing the database between each run, and the result was always the same - a report of no pictures found.

The only way I have been able to continue to use the program is to change from a "scan all my disk" way of working, to one that only runs the program on selected directories - but that's not what I need this program for. The digital camera images are generally already fairly well organized, whereas the miscellaneous other images are not.

It's not simply the number of files I have - the reviewers guide claims to have tested on 250,000 photos so my 10,000+ images is a mere stroll in the park in comparison.


Program hangs on RAW files

One of the improvements in this release is the support of RAW files. Unforunately if I turn on this option, the program on encountering my RAW images (Minolta format) simply hangs - taking 100% CPU. Whilst the Minolta RAW format is perhaps less well reverse engineered than some of the other RAW formats, other programs do cope, and I would have expected testing at Google to have included images from all currently available cameras. Even if this were not possible, the program should have been coded in such a way that it could detect that it has got stuck on a particular image, perhaps popping up a dialog box after 20 seconds spent on an image (time adjusted for image size and CPU power) to say "I seem to be having problems decoding this image - would you like me to continue?" (or offering a preference to automatically give up on such images).

To make matters worse, when I killed off and restarted Picasa, it put up a dialog to say that it had had a problem with the raw image, but when I went to the disk to confirm that it was a valid image that other programs could handle, I could not find the image. I thought initially that Picasa had deleted the image, but fortunately that was not the case - all it had apparently done was to give the image the hidden attribute, presumably so that a subsequent scan would ignore it. However that means that none of the other programs I use can see it either, which is unacceptable.

Other bugs

The feature where rough thumbnails are displayed then as time permits filled in with better thumbnails fails to interact with the fact that the image may have been rotated between the time the rough thumbnail was shown, and the finer image was created. Many times I ended up with the image shown in one orientation, but the drop shadow around it in the other rotated orientation.

When using the program whilst it is still scanning for images, the mouse cursor flashes most distractingly. It appears as if the scanning routine is trying to set the cursor shape to the hourglass, but the main UI is resetting it.

You can't move the scanning status window off of the main monitor on a multiple monitor system - it responds to being moved vertically, but not to being moved horizontally.

One of the program's big advantages is that it does not change your images on disk - merely recording what transformations to apply to the image between reading it off disk, and displaying it on the screen. However this regard for the integrity of the original images does not extend to all exported images. I processed a lot of images which I "exported as web page", electing to use the original size images. All the unrotated images were exported as their original files as expected, but all the images whose only transformation was a rotation by 90 degrees were horribly degraded - it turns out that instead of applying the rotation transformation to the JPEG file directly (a lossless operation offered by the IJG software which Picasa uses), the images are being decompressed, rotated, and then recompressed at the current quality settings. Again unacceptable behaviour. I regard this as a bug since the unrotated images are not being recompressed, but the rotated ones are - if all images were being recompressed then it might just be justifyable as a feature (but a poor one at that!) Note that the "Export to folder" option does not does not suffer from the same problem - so Picasa does know how to do lossless rotation, but is not applying it in all places that it should.

As a further point to note, when Picasa writes JPG files out to disk (generally as a result of some export feature), it is not writing them in optimized form. This is merely a case of not setting the correct parameters when calling the IJG library that the program uses. Writing the files in optimized form saves up to 10% of disk space, especially on thumbnail images.

A trivial point, but one that maybe indicates that testing was no so thorough, the help page for the Gift CD feature is in a different font to the rest of the help. Perhaps this was a feature that was slipped in late in the development cycle.

Features than could be much improved

Marking a number of digital camera images then pressing the rotate button takes an incredibly long time. It looks as if the program may be returning to the original images on disk, and producing new thunbnails, whereas it could and should simply be rotating the existing thumbnails - an operation that is orders of magnitude faster (and since rotation of jpegs is a lossless operation) just as accurate.

When browsing to select a folder (say to export to), this must always be done by navigating through a graphical directory tree. If only the dialog allowed for an entry box into which I could type the name of the folder to use, then this would be much quicker.

The histogram overlays the image, using transparency. Why not simply move the histogram to the unused space on the left of the screen, under the effects chooser?

Similarly for the 1:1 zoom feature - why does this overlay the image, when there is plenty of space for it on the left of the screen. In its current arrangement, it is impossible to zoom view the bottom right of an image, since it is always hidden behind the zoom window itself.

When viewing the properties of an image, the popup window that displays them is not resizable, and is far too small to show the necessary information. In addition, it does not seem to be possible to select the information, so that it can be pasted into another application. Just as with the histogram and zoom windows, this information would be much better if it were displayed continuously, using that large blank space on the left of the screen.

I cannot find how to select multiple thumbnails at once, when they are in different folders. Pressing control when selecting an image adds it to the images already in the tray, as long as you are in the same folder as all the existing images, but as soon as you select an image in another folder the tray contents are reset without warning. You can choose to hold images that are in the tray by selecting them in the tray and marking them as hold, but I need a mode where I can quickly select many images from different folders to work on.

On a related area, I ran a search which finds a load of images. I then want to do some operation on all the found images - perhaps to add a label to them, or to export them. I can't see how to do this - all commands seem to operate either on a single folder, or on the contents of the picture tray.

The defocus the UI when there is a choice to be made (such as when abandoning an effect to switch to the tuning tab) gives me a headache. There is a perfectly standard method of greying out things when they are not available which is a much better effect than turning all of the controls out of focus - but even that is unnecessary in this case - the popup dialog is modal, so I am already prevented from using the controls, without needing to have them visually distinguished. Photographers spend a lot of effort in ensuring pin-sharp focus, so its a spectacularly poor choice of UI design to make them stare at deliberatly out of focus controls.

I run with multiple monitors, which offers me a very widescreen display, and is incredibly productive in many situations. In particular it is ideal for photo work, where I can have one screen showing a screen of thumbnails, where the other screen shows me the particular individual image I am working on. Unfortunately Picasa does not work in this mode - it only has a single window which shows either thumbnails or a single image - which is far less productive. Please allow me to have the thumbnail view and the detail view in different windows, and even better allow me to have multiple detail view windows open at once, so I can do side by side comparisons of images.

Labels:


Thursday, January 20, 2005

 

IE v Firefox review

A comparative review of IE v Firefox, that takes a slightly different approach to most such reviews...

 

nofollow tag

Google's definition of a nofollow tag for links has got a lot of coverage, and, as noted by Scoble, even competitors signed up to it, without involving a standards committee.

My take on the tag is that it's unlikely to decrease comment spam - in fact it will probably lead to an increase as spammers blast more sites in an effort to find those that still have an effect on the search engines rankings. What it will do is reduce the percentage of comment spam that gets to influence search results - but at the expense of reducing the influence of legitimate comment based links also.

I hope that the various citation services, that offer ways to find out who is linking to a given page, don't simply drop all nofollow links from their index. What they should do is simply use the flag to separate the links into two groups, so they can report "this page is linked to by 57 pages, and 10 others which specified nofollow".

Many are talking about using this as an editorial decision flag - "I'm linking to this page, but not necessarily for positive purposes". As such, the idea of using CSS to display such links in an alternative form is a good one,

There is a good round up of the reaction to this tag at InsideGoogle.

Sunday, January 16, 2005

 

Marc Orchant's Office Weblog

The Office Weblog covers, acording to the about page, the "office software industry" - that's office with a small 'o', not necessarily Microsoft Office, with a big 'O'. A quick review of the content shows this to be very much my definition of "smart apps and smart ideas", so there's a huge overlap with my interests that I write about here.

(Marc Orchant who writes the blog also has another older, and now much lower output level blog titled "Marc's Outlook on Productivity. Cool tools and tips to get more done with Office!" which would appear to target big 'O' Office more directly).

A few choice recent articles amongst the very prolific output are:

Thursday, January 13, 2005

 

Independent JPEG Group offers lossless JPEG optimization

My previous post talked about some new software which can compress JPEGs losslessly by up to 30% when adding them to an archive. This got me thinking about how this might be achieved, and the obvious place to experiment seemed to be the widely used Independent JPEG Group's jpeg library.

Looking through their offering, I noticed the jpegtran program, which is able to perform lossless transforms on JPEG files - such as rotation without performing the decompression and recompression which most other image processing apps do to do this.

I also found that this program offers an -optimize parameter, which the help says performs "Optimize Huffman table (smaller file, but slow compression)". The relevant command line is "jpegtran -optimize -copy all in.jpg out.jpg". Running this on a number of files showed that even this simple operation makes a big difference to the file size:
I checked that the images were identical by loading both the before and after images into an image processing application, and saving them as a 24bpp TIFF. The images from the original and the better compressed images were byte for byte identical. The "-copy all" parameter ensured that all extra information, such as the EXIF info and comment info was also retained by the operation.

Note that I am still ending up with a perfectly valid jpg format file - loadable in any image processing package. This is not what Stuffit are doing - they are trying to introduce a new image format that is incompatible with existing image processing programs.

It's certainly been an eye opener for me - I'll be reprocessing all the images on my website to take advantage of the better compression this simple operation gives me.

 

Further lossless compression of JPEG images

The makers of Stuffit have software ready for incorporation in their next version that gets around a further 30% lossless compression out of jpeg image files. These claims have been independently checked out by Jeff Gilchrist who maintains an archive comparison page. His figures were 25% compression - which is at least close to the claimed figures. Maxiumum Compression also received a beta version of the software, and achieved 23% compression on their test image.

The whitepaper (in PDF) gives few clues as to what they are doing, but speculation is rife. What seems to be self evident is that they have at least replaced the RLE and Huffman encoding stages, which are the lossless stages at that complete the JPEG algorithm, with more efficient methods. This is not novel (and would appear to be unpatentable) - after all, the official JPEG algorithm always allowed for the use of arithmetic coding instead of Huffman, though in practice this is rarely used due to existing patent complications.

There are other well known inefficiencies in the use of fixed length codes for the beginning and end of blocks, which mean that at higher (lossy) compression ratios, the JPEG algorithm performs particularly poorly - with a 30% inefficiency easily explained. Of course, when the JPEG standard was developed, computers were not as powerful, so a bit of inefficiency in the encoding was a fair price to pay to simplify the amount of processing power needed.

 

Does Google Mini give a direction for Google Desktop Search?

Google Mini is the latest version of the Google Search Applicance, aimed at small business with up to 50,000 documents on their intranet webservers.

The appliance supports searching in 220 file formats, which is around the same number claimed by Yahoo for their desktop search. This is unsurprising - I'm sure that both these products are using the Outside In technology from Stellent.

Google Desktop Search has been criticised for its poor (and currently unexpandable) support for file formats - perhaps we will see it expanding its range of supported formats via this same software library.

Wednesday, January 12, 2005

 

Interesting posts on Yahoo Desktop Search forum

The Yahoo Desktop Search forum has gathered around 250 messages since the program was launched.

There's a lot of noise there (especially asking, or in some cases demanding support for Mac, Linux, Eudora, and Thunderbird and Netscape mail), but just a few of the message contain useful information:

Tuesday, January 11, 2005

 

Yahoo Desktop Search

The last of the big pre-announced desktop search tools has arrived, that from Yahoo. Its a customized, and slightly cut down version of X1. (Then again, it is free, whereas X1 is not).

Highlights for me include
Although I like the preview pane, and it's quite fully featured, its implementation has a number of problems:
I also get the impression that the system is not able to detect files as they are created or changed, but relies on picking them up when it next runs an indexing pass - which can mean that the index is often out of date.

The system also writes a lot of files to the temporary directory (say from email attachments, or extracted from inside zip files), but does not clean them up. The next indexing pass then includes those files, so suddenly the files you are accessing regularly are appearing twice in your results - the real file, and the duplicate in the temp directory.

I found that the program seems to lock my Outlook folders - if I exit Outlook, and restart it, then I cannot access any of my emails, unless I also stop Yahoo search.

Overall, the decision to buy rather than write from scratch has given Yahoo an offering that is a strong contender in this market in terms of features and supported file types. There are a number of bugs and general very rough edges which are surprising given that, despite the beta tag on this customized version, this is a product that has been shipping for a while in the non customized form.

Wednesday, January 05, 2005

 

Dictionary as you type

ObjectGraph Dictionary is another implementation of "providing information as you type" - this time giving you dictionary entries.

Rather different than the other implementations we've seen recently, which either offered suggested words to search for, or an index of pages to visit, in this demo the list which pops up provides the full result - there is no further page to consult.

In fact if you actually press enter at the end of your typing, then nothing happens. I think that it would be useful for enter to take you to a static page of the results - that way you could do other things with the defintions you have just found, such as cut and paste them somewhere.