P O S T M O D E R N

Spidr 0.2.2 released.

auth, basic, cookiejar, cookies, http, ruby, spider, web, wsoc
Spidr 0.2.2 (code-named "next-level") has been released. This release contains a lot of changes that pushes Spidr into a new level of web spidering.

Web Spider Obstacle Course (WSOC)

Spidr 0.2.2 now requires and makes use of the new Web Spider Obstacle Course (WSOC) for testing. Before one runs the RSpec test-suite for Spidr, the WSOC server must first be started:
$ wsoc_server
Then simply run the specs as usual:
$ rake spec

Cookie support

As of 0.2.2, Spidr now comes with a CookieJar, thanks to the work of @zapnap. Now when the Spidr::Agent visits a page, any new cookie values will be merged into the CookieJar, and sent back with any future requests. Additionally, one can now access the Cookie values from a Spidr::Page object.
page.cookie
# => "COUNTRY=USA%2C97.100.45.38; expires=Mon, 18-Jan-2010 06:19:24 GMT; path=/; domain=.php.net"
page.cookies
# => ["COUNTRY=USA%2C97.100.45.38; expires=Mon, 18-Jan-2010 06:19:24 GMT; path=/; domain=.php.net"]

HTTP Basic Auth support

Spidr 0.2.2 now comes with a brand new AuthStore, for organizing HTTP Authentication credentials; also thanks to the work of @zapnap. Provided you have the credentials for the various HTTP Basic Auth protected areas that are to be spidered, Spidr can automatically respond to Basic Auth challenges. Simply specify the credentials to the Spidr::Agent and the agent will do the rest:
Spidr.host('corporation.com') do |agent|
  agent.authorized.add('http://corporation.com/private/', 'user1233', 'motivate synergize')

  agent.every_page do |page|
    if page.url.path =~ /private/
      # ...
    end
  end
end

URL Sanitization

A small yet important module was added in Spidr 0.2.2, and that is Spidr::Sanitizers. The Sanitizers module adds configuration settings to Spidr::Agent for how incoming URLs are to be sanitized.

For instance, URL fragments are removed by default, but this can be changed:

agent.strip_fragments
# => true
agent.strip_fragments = true

Additionally, perhaps one might wish to strip the query strings from all URLs:

agent.strip_query = true

Note: If YARD documentation generation fails when installing Spidr 0.2.2, this is due to a bug in RDoc/SimpleMarkup generation.