Module: WaybackArchiver

Defined in:
lib/wayback_archiver.rb,
lib/wayback_archiver/archive.rb,
lib/wayback_archiver/request.rb,
lib/wayback_archiver/sitemap.rb,
lib/wayback_archiver/version.rb,
lib/wayback_archiver/response.rb,
lib/wayback_archiver/http_code.rb,
lib/wayback_archiver/sitemapper.rb,
lib/wayback_archiver/null_logger.rb,
lib/wayback_archiver/thread_pool.rb,
lib/wayback_archiver/url_collector.rb,
lib/wayback_archiver/archive_result.rb,
lib/wayback_archiver/adapters/wayback_machine.rb

Overview

WaybackArchiver, send URLs to Wayback Machine. By crawling, sitemap or by passing a list of URLs.

Defined Under Namespace

Classes: Archive, ArchiveResult, HTTPCode, NullLogger, Request, Response, Sitemap, Sitemapper, ThreadPool, URLCollector, WaybackMachine

Constant Summary collapse

'https://rubygems.org/gems/wayback_archiver'.freeze
USER_AGENT =

WaybackArchiver User-Agent

"WaybackArchiver/#{WaybackArchiver::VERSION} (+#{INFO_LINK})".freeze
DEFAULT_RESPECT_ROBOTS_TXT =

Default for whether to respect robots txt files

false
DEFAULT_CONCURRENCY =

Default concurrency for archiving URLs

1
DEFAULT_MAX_LIMIT =

Maxmium number of links posted (-1 is no limit)

-1
VERSION =

Gem version

'1.4.0'.freeze

Class Method Summary collapse

Class Method Details

.adapterInteger

Returns the configured adapter

Returns:

  • (Integer)

    the configured or the default adapter



230
231
232
# File 'lib/wayback_archiver.rb', line 230

def self.adapter
  @adapter ||= WaybackMachine
end

.adapter=(adapter) ⇒ Object, #call

Sets the adapter

Parameters:

  • ] (Object, #call)

    the adapter

Returns:

  • (Object, #call)

    ] the configured adapter



220
221
222
223
224
225
226
# File 'lib/wayback_archiver.rb', line 220

def self.adapter=(adapter)
  unless adapter.respond_to?(:call)
    raise(ArgumentError, 'adapter must implement #call')
  end

  @adapter = adapter
end

.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Send URLs to Wayback Machine.

Examples:

Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com') # Default strategy is :auto
WaybackArchiver.archive('example.com', strategy: :auto)
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :auto, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :auto)

Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com', strategy: :crawl)
WaybackArchiver.archive('example.com', strategy: :crawl, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :crawl, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :crawl)

Send example.com Sitemap URLs

WaybackArchiver.archive('example.com', strategy: :sitemap)
WaybackArchiver.archive('example.com', strategy: :sitemap, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :sitemap, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :sitemap)

Send only example.com

WaybackArchiver.archive('example.com', strategy: :url)
WaybackArchiver.archive('example.com', strategy: :url, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :url, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :url)

Crawl multiple hosts

WaybackArchiver.archive(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)

Parameters:

  • source (String/Array<String>)

    for URL(s).

  • strategy (String/Symbol) (defaults to: :auto)

    of source. Supported strategies: crawl, sitemap, url, urls, auto.

  • hosts (Array<String, Regexp>) (defaults to: [])

    to crawl.

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.



57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/wayback_archiver.rb', line 57

def self.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  strategy = legacy_strategy || strategy

  case strategy.to_s
  when 'crawl'   then crawl(source, concurrency: concurrency, limit: limit, hosts: hosts, &block)
  when 'auto'    then auto(source, concurrency: concurrency, limit: limit, &block)
  when 'sitemap' then sitemap(source, concurrency: concurrency, limit: limit, &block)
  when 'urls'    then urls(source, concurrency: concurrency, limit: limit, &block)
  when 'url'     then urls(source, concurrency: concurrency, limit: limit, &block)
  else
    raise ArgumentError, "Unknown strategy: '#{strategy}'. Allowed strategies: sitemap, urls, url, crawl"
  end
end

.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Look for Sitemap(s) and if nothing is found fallback to crawling. Then send found URLs to the Wayback Machine.

Examples:

Auto archive example.com

WaybackArchiver.auto('example.com') # Default concurrency is 1

Auto archive example.com with low concurrency

WaybackArchiver.auto('example.com', concurrency: 1)

Auto archive example.com and archive max 100 URLs

WaybackArchiver.auto('example.com', limit: 100)

Parameters:

  • source (String)

    (must be a valid URL).

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.

See Also:



83
84
85
86
87
88
# File 'lib/wayback_archiver.rb', line 83

def self.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  urls = Sitemapper.autodiscover(source)
  return urls(urls, concurrency: concurrency, &block) if urls.any?

  crawl(source, concurrency: concurrency, &block)
end

.concurrencyInteger

Returns the default concurrency

Returns:

  • (Integer)

    the configured or the default concurrency



200
201
202
# File 'lib/wayback_archiver.rb', line 200

def self.concurrency
  @concurrency ||= DEFAULT_CONCURRENCY
end

.concurrency=(concurrency) ⇒ Integer

Sets the default concurrency

Parameters:

  • concurrency (Integer)

    the desired default concurrency

Returns:

  • (Integer)

    the desired default concurrency



194
195
196
# File 'lib/wayback_archiver.rb', line 194

def self.concurrency=(concurrency)
  @concurrency = concurrency
end

.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Crawl site for URLs to send to the Wayback Machine.

Examples:

Crawl example.com and send all URLs of the same domain

WaybackArchiver.crawl('example.com') # Default concurrency is 1

Crawl example.com and send all URLs of the same domain with low concurrency

WaybackArchiver.crawl('example.com', concurrency: 1)

Crawl example.com and archive max 100 URLs

WaybackArchiver.crawl('example.com', limit: 100)

Crawl multiple hosts

URLCollector.crawl(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)

Parameters:

  • url (String)

    to start crawling from.

  • hosts (Array<String, Regexp>) (defaults to: [])

    to crawl

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.



109
110
111
112
# File 'lib/wayback_archiver.rb', line 109

def self.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Crawling #{url}"
  Archive.crawl(url, hosts: hosts, concurrency: concurrency, limit: limit, &block)
end

.default_logger!NullLogger

Resets the logger to the default

Returns:



161
162
163
# File 'lib/wayback_archiver.rb', line 161

def self.default_logger!
  @logger = NullLogger.new
end

.loggerObject

Returns the current logger

Returns:

  • (Object)

    the current logger instance



155
156
157
# File 'lib/wayback_archiver.rb', line 155

def self.logger
  @logger ||= NullLogger.new
end

.logger=(logger) ⇒ Object

Set logger

Examples:

set a logger that prints to standard out (STDOUT)

WaybackArchiver.logger = Logger.new(STDOUT)

Parameters:

  • logger (Object)

    an object than response to quacks like a Logger

Returns:

  • (Object)

    the set logger



149
150
151
# File 'lib/wayback_archiver.rb', line 149

def self.logger=(logger)
  @logger = logger
end

.max_limitInteger

Returns the default max_limit

Returns:

  • (Integer)

    the configured or the default max_limit



213
214
215
# File 'lib/wayback_archiver.rb', line 213

def self.max_limit
  @max_limit ||= DEFAULT_MAX_LIMIT
end

.max_limit=(max_limit) ⇒ Integer

Sets the default max_limit

Parameters:

  • max_limit (Integer)

    the desired default max_limit

Returns:

  • (Integer)

    the desired default max_limit



207
208
209
# File 'lib/wayback_archiver.rb', line 207

def self.max_limit=(max_limit)
  @max_limit = max_limit
end

.respect_robots_txtBoolean

Returns the default respect_robots_txt

Returns:

  • (Boolean)

    the configured or the default respect_robots_txt



187
188
189
# File 'lib/wayback_archiver.rb', line 187

def self.respect_robots_txt
  @respect_robots_txt ||= DEFAULT_RESPECT_ROBOTS_TXT
end

.respect_robots_txt=(respect_robots_txt) ⇒ Boolean

Sets the default respect_robots_txt

Parameters:

  • respect_robots_txt (Boolean)

    the desired default

Returns:

  • (Boolean)

    the desired default for respect_robots_txt



181
182
183
# File 'lib/wayback_archiver.rb', line 181

def self.respect_robots_txt=(respect_robots_txt)
  @respect_robots_txt = respect_robots_txt
end

.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Get URLs from sitemap and send found URLs to the Wayback Machine.

Examples:

Get example.com sitemap and archive all found URLs

WaybackArchiver.sitemap('example.com/sitemap.xml') # Default concurrency is 1

Get example.com sitemap and archive all found URLs with low concurrency

WaybackArchiver.sitemap('example.com/sitemap.xml', concurrency: 1)

Get example.com sitemap archive max 100 URLs

WaybackArchiver.sitemap('example.com/sitemap.xml', limit: 100)

Parameters:

  • url (String)

    to the sitemap.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.

See Also:



125
126
127
128
# File 'lib/wayback_archiver.rb', line 125

def self.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Fetching Sitemap"
  Archive.post(URLCollector.sitemap(url), concurrency: concurrency, limit: limit, &block)
end

.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Send URL to the Wayback Machine.

Examples:

Archive example.com

WaybackArchiver.urls('example.com')

Archive example.com and google.com

WaybackArchiver.urls(%w(example.com google.com))

Archive example.com, max 100 URLs

WaybackArchiver.urls(%w(example.com www.example.com), limit: 100)

Parameters:

  • urls (Array<String>/String)

    or url.

  • concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

  • (Array<ArchiveResult>)

    of URLs sent to the Wayback Machine.



140
141
142
# File 'lib/wayback_archiver.rb', line 140

def self.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  Archive.post(Array(urls), concurrency: concurrency, &block)
end

.user_agentString

Returns the configured user agent

Returns:

  • (String)

    the configured or the default user agent



174
175
176
# File 'lib/wayback_archiver.rb', line 174

def self.user_agent
  @user_agent ||= USER_AGENT
end

.user_agent=(user_agent) ⇒ String

Sets the user agent

Parameters:

  • user_agent (String)

    the desired user agent

Returns:

  • (String)

    the configured user agent



168
169
170
# File 'lib/wayback_archiver.rb', line 168

def self.user_agent=(user_agent)
  @user_agent = user_agent
end