Module: WaybackArchiver

Defined in:: lib/wayback_archiver.rb,
lib/wayback_archiver/archive.rb,
lib/wayback_archiver/request.rb,
lib/wayback_archiver/sitemap.rb,
lib/wayback_archiver/version.rb,
lib/wayback_archiver/response.rb,
lib/wayback_archiver/http_code.rb,
lib/wayback_archiver/sitemapper.rb,
lib/wayback_archiver/null_logger.rb,
lib/wayback_archiver/thread_pool.rb,
lib/wayback_archiver/url_collector.rb,
lib/wayback_archiver/archive_result.rb,
lib/wayback_archiver/adapters/wayback_machine.rb

Overview

WaybackArchiver, send URLs to Wayback Machine. By crawling, sitemap or by passing a list of URLs.

Defined Under Namespace

Classes: Archive, ArchiveResult, HTTPCode, NullLogger, Request, Response, Sitemap, Sitemapper, ThreadPool, URLCollector, WaybackMachine

Constant Summary collapse

INFO_LINK = Link to gem on rubygems.org, part of the sent User-Agent

'https://rubygems.org/gems/wayback_archiver'.freeze

USER_AGENT = WaybackArchiver User-Agent

"WaybackArchiver/#{WaybackArchiver::VERSION} (+#{INFO_LINK})".freeze

DEFAULT_RESPECT_ROBOTS_TXT = Default for whether to respect robots txt files

false

DEFAULT_CONCURRENCY = Default concurrency for archiving URLs

DEFAULT_MAX_LIMIT = Maxmium number of links posted (-1 is no limit)

-1

VERSION = Gem version

'1.5.0'.freeze

Class Method Summary collapse

.adapter ⇒ Integer

Returns the configured adapter.
.adapter=(adapter) ⇒ Object, #call

Sets the adapter.
.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Send URLs to Wayback Machine.
.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Look for Sitemap(s) and if nothing is found fallback to crawling.
.concurrency ⇒ Integer

Returns the default concurrency.
.concurrency=(concurrency) ⇒ Integer

Sets the default concurrency.
.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Crawl site for URLs to send to the Wayback Machine.
.default_logger! ⇒ NullLogger

Resets the logger to the default.
.logger ⇒ Object

Returns the current logger.
.logger=(logger) ⇒ Object

Set logger.
.max_limit ⇒ Integer

Returns the default max_limit.
.max_limit=(max_limit) ⇒ Integer

Sets the default max_limit.
.respect_robots_txt ⇒ Boolean

Returns the default respect_robots_txt.
.respect_robots_txt=(respect_robots_txt) ⇒ Boolean

Sets the default respect_robots_txt.
.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Get URLs from sitemap and send found URLs to the Wayback Machine.
.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Send URL to the Wayback Machine.
.user_agent ⇒ String

Returns the configured user agent.
.user_agent=(user_agent) ⇒ String

Sets the user agent.

Class Method Details

.adapter ⇒ `Integer`

Returns the configured adapter

Returns:

(Integer) —

the configured or the default adapter



230
231
232

# File 'lib/wayback_archiver.rb', line 230

def self.adapter
  @adapter ||= WaybackMachine
end

.adapter=(adapter) ⇒ `Object`, `#call`

Sets the adapter

Parameters:

] (Object, #call) —

the adapter

Returns:

(Object, #call) —

] the configured adapter

# File 'lib/wayback_archiver.rb', line 220

def self.adapter=(adapter)
  unless adapter.respond_to?(:call)
    raise(ArgumentError, 'adapter must implement #call')
  end

  @adapter = adapter
end

.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

Send URLs to Wayback Machine.

Examples:

Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com') # Default strategy is :auto
WaybackArchiver.archive('example.com', strategy: :auto)
WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :auto, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :auto)

Crawl example.com and send all URLs of the same domain

WaybackArchiver.archive('example.com', strategy: :crawl)
WaybackArchiver.archive('example.com', strategy: :crawl, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :crawl, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :crawl)

Send example.com Sitemap URLs

WaybackArchiver.archive('example.com', strategy: :sitemap)
WaybackArchiver.archive('example.com', strategy: :sitemap, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :sitemap, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :sitemap)

Send only example.com

WaybackArchiver.archive('example.com', strategy: :url)
WaybackArchiver.archive('example.com', strategy: :url, concurrency: 10)
WaybackArchiver.archive('example.com', strategy: :url, limit: 100) # send max 100 URLs
WaybackArchiver.archive('example.com', :url)

Crawl multiple hosts

WaybackArchiver.archive(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)

Parameters:

source (String/Array<String>) —

for URL(s).
strategy (String/Symbol) (defaults to: :auto) —

of source. Supported strategies: crawl, sitemap, url, urls, auto.
hosts (Array<String, Regexp>) (defaults to: []) —

to crawl.

Returns:

(Array<ArchiveResult>) —

of URLs sent to the Wayback Machine.

# File 'lib/wayback_archiver.rb', line 57

def self.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  strategy = legacy_strategy || strategy

  case strategy.to_s
  when 'crawl'   then crawl(source, concurrency: concurrency, limit: limit, hosts: hosts, &block)
  when 'auto'    then auto(source, concurrency: concurrency, limit: limit, &block)
  when 'sitemap' then sitemap(source, concurrency: concurrency, limit: limit, &block)
  when 'urls'    then urls(source, concurrency: concurrency, limit: limit, &block)
  when 'url'     then urls(source, concurrency: concurrency, limit: limit, &block)
  else
    raise ArgumentError, "Unknown strategy: '#{strategy}'. Allowed strategies: sitemap, urls, url, crawl"
  end
end

.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

Look for Sitemap(s) and if nothing is found fallback to crawling. Then send found URLs to the Wayback Machine.

Examples:

Auto archive example.com

WaybackArchiver.auto('example.com') # Default concurrency is 1

Auto archive example.com with low concurrency

WaybackArchiver.auto('example.com', concurrency: 1)

Auto archive example.com and archive max 100 URLs

WaybackArchiver.auto('example.com', limit: 100)

Parameters:

source (String) —

(must be a valid URL).
concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

(Array<ArchiveResult>) —

of URLs sent to the Wayback Machine.

.concurrency ⇒ `Integer`

Returns the default concurrency

Returns:

(Integer) —

the configured or the default concurrency



200
201
202

# File 'lib/wayback_archiver.rb', line 200

def self.concurrency
  @concurrency ||= DEFAULT_CONCURRENCY
end

.concurrency=(concurrency) ⇒ `Integer`

Sets the default concurrency

Parameters:

concurrency (Integer) —

the desired default concurrency

Returns:

(Integer) —

the desired default concurrency



194
195
196

# File 'lib/wayback_archiver.rb', line 194

def self.concurrency=(concurrency)
  @concurrency = concurrency
end

.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

Crawl site for URLs to send to the Wayback Machine.

Examples:

Crawl example.com and send all URLs of the same domain

WaybackArchiver.crawl('example.com') # Default concurrency is 1

Crawl example.com and send all URLs of the same domain with low concurrency

WaybackArchiver.crawl('example.com', concurrency: 1)

Crawl example.com and archive max 100 URLs

WaybackArchiver.crawl('example.com', limit: 100)

Crawl multiple hosts

URLCollector.crawl(
  'http://example.com',
  hosts: [
    'example.com',
    /host[\d]+\.example\.com/
  ]
)

Parameters:

url (String) —

to start crawling from.
hosts (Array<String, Regexp>) (defaults to: []) —

to crawl
concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

(Array<ArchiveResult>) —

of URLs sent to the Wayback Machine.

# File 'lib/wayback_archiver.rb', line 109

def self.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  WaybackArchiver.logger.info "Crawling #{url}"
  Archive.crawl(url, hosts: hosts, concurrency: concurrency, limit: limit, &block)
end

.default_logger! ⇒ `NullLogger`

Resets the logger to the default

Returns:

(NullLogger) —

a new instance of NullLogger



161
162
163

# File 'lib/wayback_archiver.rb', line 161

def self.default_logger!
  @logger = NullLogger.new
end

.logger ⇒ `Object`

Returns the current logger

Returns:

(Object) —

the current logger instance



155
156
157

# File 'lib/wayback_archiver.rb', line 155

def self.logger
  @logger ||= NullLogger.new
end

.logger=(logger) ⇒ `Object`

Set logger

Examples:

set a logger that prints to standard out (STDOUT)

WaybackArchiver.logger = Logger.new(STDOUT)

Parameters:

logger (Object) —

an object than response to quacks like a Logger

Returns:

(Object) —

the set logger



149
150
151

# File 'lib/wayback_archiver.rb', line 149

def self.logger=(logger)
  @logger = logger
end

.max_limit ⇒ `Integer`

Returns the default max_limit

Returns:

(Integer) —

the configured or the default max_limit



213
214
215

# File 'lib/wayback_archiver.rb', line 213

def self.max_limit
  @max_limit ||= DEFAULT_MAX_LIMIT
end

.max_limit=(max_limit) ⇒ `Integer`

Sets the default max_limit

Parameters:

max_limit (Integer) —

the desired default max_limit

Returns:

(Integer) —

the desired default max_limit



207
208
209

# File 'lib/wayback_archiver.rb', line 207

def self.max_limit=(max_limit)
  @max_limit = max_limit
end

.respect_robots_txt ⇒ `Boolean`

Returns the default respect_robots_txt

Returns:

(Boolean) —

the configured or the default respect_robots_txt



187
188
189

# File 'lib/wayback_archiver.rb', line 187

def self.respect_robots_txt
  @respect_robots_txt ||= DEFAULT_RESPECT_ROBOTS_TXT
end

.respect_robots_txt=(respect_robots_txt) ⇒ `Boolean`

Sets the default respect_robots_txt

Parameters:

respect_robots_txt (Boolean) —

the desired default

Returns:

(Boolean) —

the desired default for respect_robots_txt



181
182
183

# File 'lib/wayback_archiver.rb', line 181

def self.respect_robots_txt=(respect_robots_txt)
  @respect_robots_txt = respect_robots_txt
end

.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

Get URLs from sitemap and send found URLs to the Wayback Machine.

Examples:

Get example.com sitemap and archive all found URLs

WaybackArchiver.sitemap('example.com/sitemap.xml') # Default concurrency is 1

Get example.com sitemap and archive all found URLs with low concurrency

WaybackArchiver.sitemap('example.com/sitemap.xml', concurrency: 1)

Get example.com sitemap archive max 100 URLs

WaybackArchiver.sitemap('example.com/sitemap.xml', limit: 100)

Parameters:

url (String) —

to the sitemap.
concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

(Array<ArchiveResult>) —

of URLs sent to the Wayback Machine.

.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

Send URL to the Wayback Machine.

Examples:

Archive example.com

WaybackArchiver.urls('example.com')

Archive example.com and google.com

WaybackArchiver.urls(%w(example.com google.com))

Archive example.com, max 100 URLs

WaybackArchiver.urls(%w(example.com www.example.com), limit: 100)

Parameters:

urls (Array<String>/String) —

or url.
concurrency (Integer) (defaults to: WaybackArchiver.concurrency)

Returns:

(Array<ArchiveResult>) —

of URLs sent to the Wayback Machine.



140
141
142

# File 'lib/wayback_archiver.rb', line 140

def self.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block)
  Archive.post(Array(urls), concurrency: concurrency, &block)
end

.user_agent ⇒ `String`

Returns the configured user agent

Returns:

(String) —

the configured or the default user agent



174
175
176

# File 'lib/wayback_archiver.rb', line 174

def self.user_agent
  @user_agent ||= USER_AGENT
end

.user_agent=(user_agent) ⇒ `String`

Sets the user agent

Parameters:

user_agent (String) —

the desired user agent

Returns:

(String) —

the configured user agent



168
169
170

# File 'lib/wayback_archiver.rb', line 168

def self.user_agent=(user_agent)
  @user_agent = user_agent
end

Module: WaybackArchiver

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.adapter ⇒ Integer

.adapter=(adapter) ⇒ Object, #call

.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Examples:

Crawl example.com and send all URLs of the same domain

Crawl example.com and send all URLs of the same domain

Send example.com Sitemap URLs

Send only example.com

Crawl multiple hosts

.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Examples:

Auto archive example.com

Auto archive example.com with low concurrency

Auto archive example.com and archive max 100 URLs

.concurrency ⇒ Integer

.concurrency=(concurrency) ⇒ Integer

.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Examples:

Crawl example.com and send all URLs of the same domain

Crawl example.com and send all URLs of the same domain with low concurrency

Crawl example.com and archive max 100 URLs

Crawl multiple hosts

.default_logger! ⇒ NullLogger

.logger ⇒ Object

.logger=(logger) ⇒ Object

Examples:

set a logger that prints to standard out (STDOUT)

.max_limit ⇒ Integer

.max_limit=(max_limit) ⇒ Integer

.respect_robots_txt ⇒ Boolean

.respect_robots_txt=(respect_robots_txt) ⇒ Boolean

.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Examples:

Get example.com sitemap and archive all found URLs

Get example.com sitemap and archive all found URLs with low concurrency

Get example.com sitemap archive max 100 URLs

.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>

Examples:

Archive example.com

Archive example.com and google.com

Archive example.com, max 100 URLs

.user_agent ⇒ String

.user_agent=(user_agent) ⇒ String

.adapter ⇒ `Integer`

.adapter=(adapter) ⇒ `Object`, `#call`

.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

.concurrency ⇒ `Integer`

.concurrency=(concurrency) ⇒ `Integer`

.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

.default_logger! ⇒ `NullLogger`

.logger ⇒ `Object`

.logger=(logger) ⇒ `Object`

.max_limit ⇒ `Integer`

.max_limit=(max_limit) ⇒ `Integer`

.respect_robots_txt ⇒ `Boolean`

.respect_robots_txt=(respect_robots_txt) ⇒ `Boolean`

.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ `Array<ArchiveResult>`

.user_agent ⇒ `String`

.user_agent=(user_agent) ⇒ `String`