|Anonymous | Login | Signup for a new account||2019-11-17 12:56 CET|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000518||YaCy||[All Projects] General||public||2014-12-27 00:25||2016-03-02 01:22|
|Product Version||YaCy 1.8|
|Target Version||Fixed in Version|
|Summary||0000518: YaCy does not appear to respect Crawl-delay:|
|Description||From testing on my own sites it appears that YaCy still fetches pages a lot faster than 1 page per 10 seconds even if robots.txt says "Crawl-delay: 10".|
YaCy really should respect Crawl-delay. I believe the crawlers failure to respect this and the utterly bad way it behaves in general is why some sites choose to deny YaCy in robots.txt - like ZH http://www.zerohedge.com/robots.txt [^]
I have also noticed that some sites just block YaCy by IP due to too many requests per minute and it is bad when your IP becomes blocked on sites you usually visit regularly (which is a reason one may want to crawl them).
/CrawlStartSite.html actually says " No more that two pages are loaded from the same host in one second (not more that 120 document per minute) to limit the load on the target server." and this is, in my opinion, a very bad design decision. If 10 spiders do this (and there are at minimum 10 spiders crawling my more popular sites at all times) then that's a whole lot of load on a dynamically generated site.
|Tags||No tags attached.|
IIRC, the access delay is cumulative for all crawlers.
About changing the user agent string, I've opened the wishlist ticket 0000579.
fix getting robots.txt for sites using none standard port (80 / 443)
|2014-12-27 00:25||oyvinds||New Issue|
|2015-05-23 01:58||Davide||Note Added: 0001052|
|2016-03-02 01:22||BuBu||Note Added: 0001223|
|2016-03-02 01:22||BuBu||Status||new => resolved|
|2016-03-02 01:22||BuBu||Resolution||open => fixed|
|2016-03-02 01:22||BuBu||Assigned To||=> BuBu|
|Copyright © 2000 - 2019 MantisBT Team|