YaCy-Bugtracker - YaCy
View Issue Details
0000646YaCy[All Projects] Generalpublic2016-03-22 11:182016-09-01 15:54
YaCy 1.8 
0000646: Refactoring of some postprocessing procedures
It would be very great to improve some postprocessing procedures of YaCy since they're taking currently a very long time to complete.
Postprocessing Progress
busy:postprocessed 8200 from 100556956 collection documents; 8 ppm; 12189368 minutes remaining
No tags attached.
png postProcessProfiling.png (113,996) 2016-08-26 13:55
Issue History
2016-03-22 11:18LA_FORGENew Issue
2016-03-22 11:29LA_FORGENote Added: 0001230
2016-08-26 13:55lucFile Added: postProcessProfiling.png
2016-08-26 14:03lucNote Added: 0001287
2016-08-26 14:05lucNote Edited: 0001287bug_revision_view_page.php?bugnote_id=1287#r370
2016-08-28 20:06LA_FORGENote Added: 0001288
2016-08-30 09:44lucNote Added: 0001291
2016-09-01 15:54lucNote Added: 0001292

2016-03-22 11:29   
The system load is very high during the process mentioned above:
load average: 10.54, 10.92, 10.72
These values representing load of 140% for my 8-core System


2016-08-26 14:03   
(edited on: 2016-08-26 14:05)
Hi, I attached a first profiling trace performed with VisualVM on a Debian Jessie machine, postprocessing 11240 recently crawled documents, with webgraph Solr core disabled.

With this config, the two main hotspots consuming processor time are clearly identified and are with no surprise related to Solr operations :
- in CollectionConfiguration.postprocessing_doublecontent : the internal parsing of the Solr query used to search for double documents (ExtendedDismaxQParser.parse - 41 %)
- in CollectionConfiguration.postprocessing : the partial update (ConcurrentUpdateSolrConnector.update() - 34,1 %) of each document

2016-08-28 20:06   
Thank you very much for the detailed analysis. Using Debian Jessie here, too. Runs much smoother since upgrading to YaCy 1.9 and Java 8 using additional java args -XX:+UseParallelGC -XX:+UseNUMA in an mutiprocessor environment. I'm sure the new solr version is improved, too.
2016-08-30 09:44   
I continued experimenting a little bit to see how much this can be improved.
Just to check, I disabled double documents search and modified how updates are performed : I grouped them by collections of 100 and even 1000 documents to avoid committing at each update.
With my small test documents number (11240) this approximately divides processing time by two... But no miracle, with this logic and millions of documents to process, the time would still be far too long.
I wonder if this processings could not be performed at another more appropriate place/time...
2016-09-01 15:54   
Unfortunately more extended profiling revealed that the idea of grouping commit updates is ineffective... The main real CPU burning hotspot is in the postprocessing_doublecontent query.
With my current knowledge I do not know how to drastically improve this...

By the way, for now I commited some refactorings to at least more easily understand the whole postprocessing algorithm, and profile performances with VisualVM for example.
See Pull Request : https://github.com/yacy/yacy_search_server/pull/71 [^]