View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000774YaCyWishlist - Wunschlistepublic2017-12-27 16:322017-12-27 16:35
Assigned To 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000774: [Feature request] Add crawler filter based on document language
DescriptionCurrently the crawler has a raw localization filter based on the IP country code of the host server; an improvement would be to also offer a filter based on the actual language detected from the document content.

_ Main pros:
it would be easier to setup a large crawler job aimed at one specific language spanning a wide array of domain names and subdomains without having to specify individual regex filter rules for each of those domains, since domain names occasionally contain a country code; also consider that the crawler may be allowed to extent to domain names or hosts not initially listed in the crawler job specification, therefore making it harder to specify such rules.

The current filter based on country code may also be inadequate as a single host may offer documents in multiple languages.

_ Main cons:
a simple DNS lookup wouldn't be enough to execute the filter rules, as the document needs to be downloaded and parsed in order to pass it thru the filter.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
There are no notes attached to this issue.

- Issue History
Date Modified Username Field Change
2017-12-27 16:32 Davide New Issue

Copyright © 2000 - 2018 MantisBT Team
Powered by Mantis Bugtracker