|View Issue Details [ Jump to Notes ] ||[ Issue History ] [ Print ] |
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000738||YaCy||[All Projects] General||public||2017-04-22 02:38||2019-07-28 08:38|
|Assigned To|| |
|Product Version||YaCy 1.9|| |
|Target Version||Fixed in Version|| |
|Summary||0000738: YaCy stuck in noindex,follow wasteland|
|Description||I've seen this happening many times: crawler is about to crawl a set of forums or search results or archive pages (the kind of pages with little or duplicate content which are usually not to be indexed by search engines). SEOs commonly meta-tag such pages as "noindex,follow" to let pagerank flow through the site, but the same time avoid such pages spam the Google index.|
YaCy takes their instruction (noindex,follow) as is, but obviously has no way to deal with it properly. Thus YaCy will (in certain situations) crawl them for hours without indexing.
|Additional Information||Same issue:|
In my case it's a scheduled job. It's limited to 100 pages per seed URL, but that limit doesn't seem to work over scheduled jobs.
Solution would be to disable "noindex,follow" crawling at all. It makes little sense for YaCy anyway. A valid assumption is that important content is always linked from indexable pages and not hidden.
|Tags||No tags attached.|