Spidr "solid" 0.2.0

2009-10-14 — crawl, crawler, gemcutter, https, link, page, pause, skip, spider, spidr, ssl, web

After a period of refactoring, Spidr 0.2.0 (code-named "solid") has been released. Many things we're added to this release, along with some very important bug-fixes and optimizations.

Major changes

Spidr, along with many of my other projects, has moved to YARD based documentation. YARD's tag based documentation format really helped me annotate every method within Spidr. The new YARD docs can be found in the usual location.

Also following suite with many of my other projects, you can now find Spidr on the awesome and easy to use gemcutter.org.

Spidr should be a little faster now. Thanks to the work of justfalter, HTTP sessions with unique hosts/ports are now cached and resued. HTTP sessions no longer have to be re-initialized upon every request. Also, the history and failures lists are now Ruby Sets, yielding improved lookup times for checking if a link has been previously visted.

The code-base of Spidr should be a little more organized. Many methods within Spidr::Agent were grouped by functionality and moved to separate modules (Events and Filters) which are included back into Spidr::Agent.

The Spidr::Actions module was also added, which adds action methods that control spidering from within every_url, every_page, every_failed_url, all_headers event hooks. The pause! method will pause the spider, while skip_page! and skip_link! can manipulate the processing of pages/links.

The Spidr::Page#search, Spidr::Page#at, Spidr::Page#title methods were also added. These methods should make Spidr::Page feel alittle more like WWW::Mechanize::Page, allowing one to search the DOM (parsed by Nokogiri) with XPath/CSS-path expressions.

Lastly, many new examples which highlight some of the less-than-trivial things you can do with Spidr have been added to the website and docs.

Important bug-fixes

Spidr::Agent should now properly handle the spidering of SSL protected websites. Also thanks to justfalter, HTTPS sessions are now properly initialized and stored in the HTTP session cache; so the SSL-handshake only need be performed once per unique host/port.

Spidr::Agent#get_page will now correctly send the URI query along with the URI path for HTTP requests. Thanks go out to Damian Steer for reporting this.

Spidr::Page#doc now returns a Nokogiri::XML::Document object for RSS/RDF/Atom pages, allowing one to properly search RSS/Atom feeds.

Spidr::Page#code will now return the HTTP Status code as an Integer.

Spidr::Page#links now properly handles the HTTP Location header.

The URI expansion/normalization performed by Spidr::Page#to_absolute was greatly improved. Spidr::Page#to_absolute will now properly preserve trailing '/' characters on URI paths.

Shout Outs

A big thanks to everyone who helped with Spidr 0.2.0 by reporting bugs and testing new code. I hope this release will help users get more out of Spidr.