YaCy-Bugtracker - YaCy
View Issue Details
0000717YaCy[All Projects] Generalpublic2017-01-05 21:262017-01-06 03:04
0000717: Index document with wrong field content for metadata from html tags
After parsing/crawling several documents, some index documents have wrong content e.g. in h1_txt and other text extracts from html tags (e.g. underline_txt or image_* index fields)

Example: <h1_txt> is not part of the page at all:
<str name="id">oqcOSGHd8iIa</str>
<str name="sku">http://worldbuilding.stackexchange.com/questions/66895/ultimate-australian-canal</str> [^]

<arr name="title">
<str>climate - Ultimate Australian Canal - Worldbuilding Stack Exchange</str>

<arr name="h1_txt">
<str>Spendenaufruf : Wikipedia sammelt 8,7 Millionen Euro</str>
<int name="h1_i">1</int>
Debug Info:
Tag/field content comes from the scraper.
The used scraper is remembered in the htmlParser.

But parser is reused for several documents and the used scraper is set to the current document, while the indexing process might work on a earlier document.

yacy2solr gets in this concurrency situation the earlier document but current scraperObject.
No tags attached.
Issue History
2017-01-05 21:26BuBuNew Issue
2017-01-05 21:26BuBuStatusnew => assigned
2017-01-05 21:26BuBuAssigned To => BuBu
2017-01-06 03:04BuBuNote Added: 0001370
2017-01-06 03:04BuBuStatusassigned => resolved
2017-01-06 03:04BuBuResolutionopen => fixed

2017-01-06 03:04   
see commit https://github.com/yacy/yacy_search_server/commit/4c9be29a55b51d9937137806ed4f248875c32a2b [^]