与R中的data.table聚合
|
练习包括通过因子的组合和R中的data.table来聚合值的数值向量.以下面的数据表为例: require (data.table) require (plyr) dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3],each = 3),fac = letters[1:3]),value = rnorm (27))) 请注意,’month’和’fac’的每个独特组合都会出现三次.因此,当我尝试通过这两个因素平均值时,我应该期望一个包含9个唯一行的数据框: (agg1 <- ddply (dtb,c ("month","fac"),function (dfr) mean (dfr$value)))
month fac V1
1 Jan a -0.36030953
2 Jan b -0.58444588
3 Jan c -0.15472876
4 Feb a -0.05674483
5 Feb b 0.26415972
6 Feb c -1.62346772
7 Mar a 0.24560510
8 Mar b 0.82548140
9 Mar c 0.18721114
但是,当与data.table聚合时,我会不断得到两个因素的每个冗余组合提供的结果: (agg2 <- dtb[,value := mean (value),by = list (month,fac)])
month fac value
1: Jan a -0.36030953
2: Jan a -0.36030953
3: Jan a -0.36030953
4: Feb a -0.05674483
5: Feb a -0.05674483
6: Feb a -0.05674483
7: Mar a 0.24560510
8: Mar a 0.24560510
9: Mar a 0.24560510
10: Jan b -0.58444588
11: Jan b -0.58444588
12: Jan b -0.58444588
13: Feb b 0.26415972
14: Feb b 0.26415972
15: Feb b 0.26415972
16: Mar b 0.82548140
17: Mar b 0.82548140
18: Mar b 0.82548140
19: Jan c -0.15472876
20: Jan c -0.15472876
21: Jan c -0.15472876
22: Feb c -1.62346772
23: Feb c -1.62346772
24: Feb c -1.62346772
25: Mar c 0.18721114
26: Mar c 0.18721114
27: Mar c 0.18721114
month fac value
是否有一种优雅的方法可以将这些结果折叠为每个独特的因子组合与数据表的一行? 问题(和推理)与聚合值的分配不仅仅是计算有关.如果你查看一个包含更多列而不仅仅是用于计算的列的data.table,则更容易观察到这一点. # Therefore,let's add a new column dtb[,newCol := LETTERS[seq(length(value))] 请注意,如果我们只想输出计算值,那么RHS上的表达式就好了. # This gives the expected results dtb[,mean (value),fac)] # This on the other hand assigns the respective values to *each* row dtb[,fac)] 换句话说,数据被子集化为仅返回唯一值. 然后将此data.table复制到agg仍然会通过所有行发送. 因此,如果要复制到新表,只能从原始表中那些唯一的行,即可 a. wrap the original table inside `unique()` before assigning it
b. assign the table,above,that is returned when you
are not assigning the RHS output (which is what @Arun suggested)
一个例子.将会: agg2 <- unique(dtb[,fac)]) 以下示例可能有助于说明. (你需要复制粘贴,因为省略了输出) # SAMPLE DATA,as above
library(data.table)
dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3],value = rnorm (27))
# METHOD 1 #
#------------#
dtb <- copy(dtb.bak) # restore,from sample data.
dtb[,fac)]
dtb
# this is what you would like to assign
unique(dtb)
# METHOD 2 #
#------------#
dtb <- copy(dtb.bak) # restore,from sample data.
# this is what you would like to assign
# next two lines are the same,only differnce is column name
dtb[,fac)]
dtb[,list("mean" = mean (value)),fac)] # quote marks added for clarity
# dtb is unchanged.
dtb
# NOW COMPARE THE SAME TWO METHODS,BUT IF THERE IS AN ADDITIOANL COLUMN
dtb.bak[,newCol := rep(c("A","B","A"),length(value)/3)]
dtb1 <- copy(dtb.bak) # restore,from sample data.
dtb2 <- copy(dtb.bak) # restore,from sample data.
# Method 1
dtb1[,fac)]
dtb1
unique(dtb1)
# METHOD 2 #
dtb2[,fac)] # quote marks added for clarity
dtb2
# METHOD 2,WITH ADDED COLUMNS IN list() in `j`
dtb2[,list("mean" = mean (value),newCol),fac)] # quote marks added for clarity
# notice this has more columns thatn
unique(dtb1) (编辑:安卓应用网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |
