Personal tools
You are here: Home / Short report expert workshop 'Using web crawling data in identifying new jobs and new skills'

Short report expert workshop 'Using web crawling data in identifying new jobs and new skills'

Session 1

First session started with Heidemarie Müller-Riedlhuber (WIAB), who mainly highlighted the challenges in comparability of skills and competences and mentioned that the latter should be used as common denominator for occupation and qualification descriptions in the labour markets and education/training systems. To this end, there has been several initiatives to make the skills and competences more comparable at the European level, such as European Qualification Framework (EQF), European Credit Transfer and Accumulation System (ECTS), European Dictionary of Skills and Competences (DISCO I, II), and European Skills, Competences, Qualification, and Occupations (ESCO). However, the transparency and comparability challenges of skills and competences still remain given the difficulties involved in keeping a complex, overarching, and multilingual classification for occupations, qualifications, skills and competences. Moreover, there are continuous entry of new occupations as well as exit of old ones in the labour markets, which then requires an up-to-date of classifications taking into account such changes in a multilingual environment. Similarly, formulation and translation of occupations titles, qualifications, and skills/competences can involve ambiguity of terms and language differences, which also requires substantial effort in maintaining multilingual taxonomies. She reckons that use of intelligent software and internet data can help overcoming some of the aforementioned challenges and issues by supporting the maintenance and translation of occupations, qualifications, and skills/competences classifications as well as the search for new qualifications by looking at the quantity of usage, source, and context both at national and international levels. Nevertheless, web crawling/text mining methods are not immune to problems either. She listed a list of properties that crawling methods should have in order to provide reliable data. Accordingly, web crawlers need to have sufficient amount of data to provide a statistically relevant information, select and recognise the relevant content, qualification title and skills/competences at the right sources (relevant URLs), and interpret and compare them accurately. Participants shared views on the challenges of comparability of skills and competences and agreed on the need of machine-based/automated methods to evaluate the whole context when coding job ads bearing in mind issues of language and cross-country diverging skill expressions. With regards to (in)comparability of occupations, Kea Tijdens (University of Amsterdam) mentioned another project that she has undertaken by consulting with experts and analysing more than 100 occupations in 8 countries, where they found no similarity in majority of the occupations. Andras Levai (Szechenyi Istvan University) shared lessons learned from his experience in developing web crawling methods, which target extracting skills and job related information from texts of job vacancies. He acknowledged that web crawling is not an easy task since it requires a unique code for each vacancy site and occasionally faces the problem that some vacancy sites prevent crawlers from grabbing information from their site (for example, LinkedIn). Moreover, there are difficulties in dealing with many job titles that exist, which necessitates proper occupational coding and classification, and there is ongoing research on how far ESCO classifications can help about that. Finally, he mentioned a very interesting dimension related to the legal framework of web crawling methods.

The question on the legal framework of web crawling methods stimulated further discussion among the participants. The legality question arise from the fact that mainly the vacancies do not belong to the job portals, but rather to the public or private portals. On the one hand, the issue is not straightforward at first look, yet collection of internet data is massive and seems to be allowed generally. On the other hand, the legal limits might be country-specific and follow a certain set of privacy and security guidelines; for example, OECD provides such data privacy guidelines. However, given that there is a lot of academic and policy discussion on the skills gap and mismatch in the labour markets, there seems to be a consensus among participants suggesting that there is an overall public interest in collecting and analysing vacancy data even if it involves crawling/grabbing methods from online job boards. The last speaker of the session, Hanne Shapiro (Danish Technological Institute), started her talk by emphasising the importance of labour market information in general and the complementary nature of the real-time labour market information provided by job portals to the traditional surveybased one in order to analyse current labour market conditions and aim at getting snapshot insights on the skill as well as supply/demand gaps or mismatches. In the meantime, most of job postings on internet are biased toward industries and occupations that seek high-skilled white-collar workers; for instance, Georgetown University estimates that 60-70% of new job openings in the US are posted online and target graduates of college degree and upward. Similar to the first presentation, Hanne also highlighted the lack of standardisation of job ads posted online, where employers do not necessarily pay attention to the format and language issues, which can distort the real-time information. She continued with the main shortcomings real-time labour market information: not all job openings are posted online (giving only a part of the picture) and only a few online job ads contain complete information on desired skills, competences, and personal attributes that could be consistently grabbed by the crawler methods. Furthermore, there could be duplicates of the same job ad on several portals distorting the overall data quality.

Discussions continued among the participants who acknowledge the richness of real-time data despite its shortcomings and recommend a multivariate approach where one needs to look at more details rather than just job titles, but also tasks, industry, skills and so on.

Session 2

Miroslav Beblavy (CEPS) presented their paper on job ads and demand for low skilled individuals in the Slovakian labour markets using Profesia’s data. Discussants mentioned again the representation issues of vacancy data and agreed that formal vacancies are quite far from the real labour market job rotation, since many people get hired through networks or go through internal recruitment process. Hence, the online vacancy data captures only a part of the labour market dynamics, but not all. Next, Ildiko Szabo (Corvinus University) presented the SMART project, which investigates the compliance between the educational offers and job market expectations in Andalusia region in Spain. The project involved java web crawler processing online job offers. Lastly, Matt Sigelman (Burning Glass Technologies) has presented his experience in providing and working with real-time labour market data. Regardless of the skill gaps in the labour market, he emphasised that there is first of all an information gap between employers and job seekers. The information gap is reinforced given that there is no common language in the workforce, making the comparisons and matches difficult. To this end, the ability to aggregate labour market signals at different levels is all the more important and his experience suggests that skill-level analysis has been the most fruitful in that sense, whereby the goal is to understand the skill sets required by the employers and those available in the labour market. Therefore, for accurate labour market information analysis with internet data, data crawling/text mining algorithms must be able to know which skills to capture and which ones to ignore when grabbing online vacancies and need to be context-sensitive when coding them. This enables the crawled job market data accessible and analysable not only to job seekers but also to researchers working with real-time market data. For example, standardising skill variants and mapping skill relationships across clusters facilitates aggregate information search and gives useful information on the current and emerging trend as well as tracking hard-to-fill occupations in the labour markets. Such information would not only guide job seekers but also could be used to guide university and training programs about emerging skill patterns demanded in the market. Finally, using predictive tools and further analysis with such data, lessons could be drawn that help align supply and demand to improve employment and/or reduce  long-term unemployment risks in the future.

Participants discussed the role of foundational/soft skills in emerging jobs and questioned how important the former are compared to technical skills. The data suggests that while both types of skills are rather important, the exact link between the 2 depends on the job and the context. At the same time, evidence shows that even though many jobs officially require a university diploma, students tend to additionally acquire a set of foundational/soft skills to complement their skill sets given how the latter are more and more valued in the market. What is more, labour markets continue to evolve and recent market data also shows that many jobs that are previously known to have less emphasis on technical skills and more on soft skills now require STEM skills. A good example to this is the marketing jobs which require more and more statistical and data analysis skills. Finally, participants questioned whether it is possible to know if a vacancy is filled using real-time labour market data. Unfortunately, such follow-up is not possible; however, what is possible is to see how long a vacancy stays on the job portal and if the same vacancy has been reposted, which in turn can give an idea on the ease of filling this particular vacancy.

Session 3

In the last session, Brian Fabo (CELSI) mainly talked about the voluntary web-based survey, WageIndicator, which collects information on salaries, occupations, and tasks across a set of countries. He mentioned that semantic matching algorithms are generally used to analyse occupation titles, which can be made compatible with international classifications such as ISCO and ESCO. However, as discussed earlier, many limitations of working with occupational titles across countries exist for various reasons: increased fragmentalisation of the labour market, comparability of the same occupation title across industries, and so on. Therefore, they take an alternative approach and focus on what people actually do in their jobs rather than on their job titles. This implies a competence or task-based approach, which is more comparable and easier to analyse than the occupational titles. The WageIndicator survey could be used for that purpose. However, given the voluntary nature of this type of web-based surveys, self-selection and representation issues arise again with data source. Next, Jacub Zavrel (Textkernel) shared his experience of using crawling, parsing, and coding of online job ads to enable text mining for skills in the European labour markets. He also acknowledged that working with the European labour market data involves further difficulties due to its multilingual structure. Their work mainly involves searching and analysing real-time online job ads as well as historical data and providing data for text mining and producing ‘jobfeeds’, which allow users to learn about opportunities on the job market in real-time. They also face difficulties such as ambiguity in the use of context and dealing with job title distributions with very long tail. Their method relies on automated coding of occupations based on existing taxonomies and using semantic matching tools to organise large and unstructured online job data to make it analysable. The main advantage of the semantic matching method is that they can always go and consult with the raw data (i.e. vacancy text). Based on the analysis of job ads data, he reemphasised the importance of guiding students - while they are still in college - about important and demanded skills that are useful for the job market.

As the last speaker, Kea Tijdens (University of Amsterdam) initially focused on the challenge of multi-country occupational coding. While this challenge has been widely discussed overall during the workshop, Kea gave an interesting global picture of the problems involved in cross-country occupational coding, where the width and depth of job titles vary depending on the actors, i.e. job seekers, employers, professional associations, educational institutes, etc. Then she focused on the issues of match between educational requirement in vacancies and education levels of job holders based on a paper using matched EURES (vacancies) and WageIndicator (jobholder information) data in the case of Czech Republic. Their analysis contributes to the limited knowledge about how employers adjust their vacancy behaviour in case of low or abundant supply of jobseekers within a certain skill group.

At the end of the presentations, Miroslav Beblavy (CEPS) shared his thoughts on take-home messages and concluding remarks of the workshop. Accordingly, the main advantages of internetbased job ads data are (1) having the real-time nature and immediacy, (2) usually being given in large samples, and (3) acting as an imperfect mirror of the labour market dynamics. They could give an idea about the skills and competences in demand by employers - thus also help guiding education policies and career paths - as well as convey a strategic communication (e.g. brandbuilding, firmgrowth) of the hiring firm. The main weaknesses are related to (1) lack of comparability of occupation and job data across countries and over time, (2) data quality and reliability due to representation concerns, duplications, and diverse coding practices. All these shortcomings seem to make online data not a perfect substitute to the traditional labour market data, but rather a complementary source of information.


More information can be found here.


  • Last modified 11-01-2017