

,IFNULL(ANY_VALUE(IF(tag2='jquery',1,null)),0) XjqueryįROM `deleting.stack_overflow_tag_co_ocurrence` ,IFNULL(ANY_VALUE(IF(tag2='android',1,null)),0) Xandroid

,IFNULL(ANY_VALUE(IF(tag2='python',1,null)),0 ) Xpython ,IFNULL(ANY_VALUE(IF(tag2='javascript',1,null)),0) Xjavascript You can reduce or augment the sensibility of these relations with the percent threshold: SELECT tag1 Let’s see first a subset of these results: What you see here is a co-occurrence matrix: Then I can use that string to get a huge table, with a 1 for every time a tag co-occurs with the main one at least certain % of time. So I’m going to create a string first that will define all the columns where I want to find co-occurrence. BigQuery ML does a good job of hot-encoding strings, but it doesn’t handle arrays as I wish it did (stay tuned). ) One-hot encoding Now get ready for some SQL magic. WHERE tag1 IN (SELECT tag FROM active_tags)ĪND tag2 IN (SELECT tag FROM active_tags) SELECT *, MAX(questions) OVER(PARTITION BY tag1) questions_tag1įROM data, UNNEST(SPLIT(tags, '|')) tag1, UNNEST(SPLIT(tags, '|')) tag2 SELECT *, questions/questions_tag1 percent CREATE OR REPLACE TABLE `deleting.stack_overflow_tag_co_ocurrence`įROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`
#Java lwjgl texture renders with blackness to right plus#
So I’ll take these relationships and I’ll save them on an auxiliary table - plus a percentage of how frequently a relationship happens for each tag. 提示:共现标签 Let’s find tags that usually go together:Ĭo-occurring tags on Stack Overflow questions Top Stack Overflow tags by number of questions. In this picture I only have 240 tags - how would you group and categorize 4,000+ of them? # Tags with >180 questions since 2018įROM `fh-bigquery.stackoverflow_archive.201906_posts_questions`, 4,000+ tags are a lot These are the most active Stack Overflow tags since 2018 - they’re a lot. You can check out more about working with Stack Overflow data and BigQuery here and here. In this post he works with BigQuery – Google’s serverless data warehouse – to run k-means clustering over Stack Overflow’s published dataset, which is refreshed and uploaded to Google’s Cloud once a quarter. Visualizing a universe of clustered tags.įelipe Hoffa is a Developer Advocate for Google Cloud.
