Module: WaybackArchiver
- Defined in:
- lib/wayback_archiver.rb,
lib/wayback_archiver/archive.rb,
lib/wayback_archiver/request.rb,
lib/wayback_archiver/sitemap.rb,
lib/wayback_archiver/version.rb,
lib/wayback_archiver/response.rb,
lib/wayback_archiver/http_code.rb,
lib/wayback_archiver/sitemapper.rb,
lib/wayback_archiver/null_logger.rb,
lib/wayback_archiver/thread_pool.rb,
lib/wayback_archiver/url_collector.rb,
lib/wayback_archiver/archive_result.rb,
lib/wayback_archiver/adapters/wayback_machine.rb
Overview
WaybackArchiver, send URLs to Wayback Machine. By crawling, sitemap or by passing a list of URLs.
Defined Under Namespace
Classes: Archive, ArchiveResult, HTTPCode, NullLogger, Request, Response, Sitemap, Sitemapper, ThreadPool, URLCollector, WaybackMachine
Constant Summary collapse
- INFO_LINK =
Link to gem on rubygems.org, part of the sent User-Agent
'https://rubygems.org/gems/wayback_archiver'.freeze
- USER_AGENT =
WaybackArchiver User-Agent
"WaybackArchiver/#{WaybackArchiver::VERSION} (+#{INFO_LINK})".freeze
- DEFAULT_RESPECT_ROBOTS_TXT =
Default for whether to respect robots txt files
false
- DEFAULT_CONCURRENCY =
Default concurrency for archiving URLs
1
- DEFAULT_MAX_LIMIT =
Maxmium number of links posted (-1 is no limit)
-1
- VERSION =
Gem version
'1.4.0'.freeze
Class Method Summary collapse
-
.adapter ⇒ Integer
Returns the configured adapter.
-
.adapter=(adapter) ⇒ Object, #call
Sets the adapter.
-
.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URLs to Wayback Machine.
-
.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Look for Sitemap(s) and if nothing is found fallback to crawling.
-
.concurrency ⇒ Integer
Returns the default concurrency.
-
.concurrency=(concurrency) ⇒ Integer
Sets the default concurrency.
-
.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Crawl site for URLs to send to the Wayback Machine.
-
.default_logger! ⇒ NullLogger
Resets the logger to the default.
-
.logger ⇒ Object
Returns the current logger.
-
.logger=(logger) ⇒ Object
Set logger.
-
.max_limit ⇒ Integer
Returns the default max_limit.
-
.max_limit=(max_limit) ⇒ Integer
Sets the default max_limit.
-
.respect_robots_txt ⇒ Boolean
Returns the default respect_robots_txt.
-
.respect_robots_txt=(respect_robots_txt) ⇒ Boolean
Sets the default respect_robots_txt.
-
.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Get URLs from sitemap and send found URLs to the Wayback Machine.
-
.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URL to the Wayback Machine.
-
.user_agent ⇒ String
Returns the configured user agent.
-
.user_agent=(user_agent) ⇒ String
Sets the user agent.
Class Method Details
.adapter ⇒ Integer
Returns the configured adapter
230 231 232 |
# File 'lib/wayback_archiver.rb', line 230 def self.adapter @adapter ||= WaybackMachine end |
.adapter=(adapter) ⇒ Object, #call
Sets the adapter
220 221 222 223 224 225 226 |
# File 'lib/wayback_archiver.rb', line 220 def self.adapter=(adapter) unless adapter.respond_to?(:call) raise(ArgumentError, 'adapter must implement #call') end @adapter = adapter end |
.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URLs to Wayback Machine.
57 58 59 60 61 62 63 64 65 66 67 68 69 |
# File 'lib/wayback_archiver.rb', line 57 def self.archive(source, legacy_strategy = nil, strategy: :auto, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) strategy = legacy_strategy || strategy case strategy.to_s when 'crawl' then crawl(source, concurrency: concurrency, limit: limit, hosts: hosts, &block) when 'auto' then auto(source, concurrency: concurrency, limit: limit, &block) when 'sitemap' then sitemap(source, concurrency: concurrency, limit: limit, &block) when 'urls' then urls(source, concurrency: concurrency, limit: limit, &block) when 'url' then urls(source, concurrency: concurrency, limit: limit, &block) else raise ArgumentError, "Unknown strategy: '#{strategy}'. Allowed strategies: sitemap, urls, url, crawl" end end |
.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Look for Sitemap(s) and if nothing is found fallback to crawling. Then send found URLs to the Wayback Machine.
83 84 85 86 87 88 |
# File 'lib/wayback_archiver.rb', line 83 def self.auto(source, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) urls = Sitemapper.autodiscover(source) return urls(urls, concurrency: concurrency, &block) if urls.any? crawl(source, concurrency: concurrency, &block) end |
.concurrency ⇒ Integer
Returns the default concurrency
200 201 202 |
# File 'lib/wayback_archiver.rb', line 200 def self.concurrency @concurrency ||= DEFAULT_CONCURRENCY end |
.concurrency=(concurrency) ⇒ Integer
Sets the default concurrency
194 195 196 |
# File 'lib/wayback_archiver.rb', line 194 def self.concurrency=(concurrency) @concurrency = concurrency end |
.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Crawl site for URLs to send to the Wayback Machine.
109 110 111 112 |
# File 'lib/wayback_archiver.rb', line 109 def self.crawl(url, hosts: [], concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) WaybackArchiver.logger.info "Crawling #{url}" Archive.crawl(url, hosts: hosts, concurrency: concurrency, limit: limit, &block) end |
.default_logger! ⇒ NullLogger
Resets the logger to the default
161 162 163 |
# File 'lib/wayback_archiver.rb', line 161 def self.default_logger! @logger = NullLogger.new end |
.logger ⇒ Object
Returns the current logger
155 156 157 |
# File 'lib/wayback_archiver.rb', line 155 def self.logger @logger ||= NullLogger.new end |
.logger=(logger) ⇒ Object
Set logger
149 150 151 |
# File 'lib/wayback_archiver.rb', line 149 def self.logger=(logger) @logger = logger end |
.max_limit ⇒ Integer
Returns the default max_limit
213 214 215 |
# File 'lib/wayback_archiver.rb', line 213 def self.max_limit @max_limit ||= DEFAULT_MAX_LIMIT end |
.max_limit=(max_limit) ⇒ Integer
Sets the default max_limit
207 208 209 |
# File 'lib/wayback_archiver.rb', line 207 def self.max_limit=(max_limit) @max_limit = max_limit end |
.respect_robots_txt ⇒ Boolean
Returns the default respect_robots_txt
187 188 189 |
# File 'lib/wayback_archiver.rb', line 187 def self.respect_robots_txt @respect_robots_txt ||= DEFAULT_RESPECT_ROBOTS_TXT end |
.respect_robots_txt=(respect_robots_txt) ⇒ Boolean
Sets the default respect_robots_txt
181 182 183 |
# File 'lib/wayback_archiver.rb', line 181 def self.respect_robots_txt=(respect_robots_txt) @respect_robots_txt = respect_robots_txt end |
.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Get URLs from sitemap and send found URLs to the Wayback Machine.
125 126 127 128 |
# File 'lib/wayback_archiver.rb', line 125 def self.sitemap(url, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) WaybackArchiver.logger.info "Fetching Sitemap" Archive.post(URLCollector.sitemap(url), concurrency: concurrency, limit: limit, &block) end |
.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) ⇒ Array<ArchiveResult>
Send URL to the Wayback Machine.
140 141 142 |
# File 'lib/wayback_archiver.rb', line 140 def self.urls(urls, concurrency: WaybackArchiver.concurrency, limit: WaybackArchiver.max_limit, &block) Archive.post(Array(urls), concurrency: concurrency, &block) end |
.user_agent ⇒ String
Returns the configured user agent
174 175 176 |
# File 'lib/wayback_archiver.rb', line 174 def self.user_agent @user_agent ||= USER_AGENT end |
.user_agent=(user_agent) ⇒ String
Sets the user agent
168 169 170 |
# File 'lib/wayback_archiver.rb', line 168 def self.user_agent=(user_agent) @user_agent = user_agent end |