与R中的data.table聚合

发布时间：2020-05-23 00:01:39 所属栏目：程序设计来源：互联网

导读：练习包括通过因子的组合和R中的data.table来聚合值的数值向量.以下面的数据表为例： require (data.table)require (plyr)dtb - data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),

练习包括通过因子的组合和R中的data.table来聚合值的数值向量.以下面的数据表为例：

require (data.table)
require (plyr)
dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3],each = 3),fac = letters[1:3]),value = rnorm (27)))

请注意,’month’和’fac’的每个独特组合都会出现三次.因此,当我尝试通过这两个因素平均值时,我应该期望一个包含9个唯一行的数据框：

(agg1 <- ddply (dtb,c ("month","fac"),function (dfr) mean (dfr$value)))
  month fac          V1
1   Jan   a -0.36030953
2   Jan   b -0.58444588
3   Jan   c -0.15472876
4   Feb   a -0.05674483
5   Feb   b  0.26415972
6   Feb   c -1.62346772
7   Mar   a  0.24560510
8   Mar   b  0.82548140
9   Mar   c  0.18721114

但是,当与data.table聚合时,我会不断得到两个因素的每个冗余组合提供的结果：

(agg2 <- dtb[,value := mean (value),by = list (month,fac)])
    month fac       value
 1:   Jan   a -0.36030953
 2:   Jan   a -0.36030953
 3:   Jan   a -0.36030953
 4:   Feb   a -0.05674483
 5:   Feb   a -0.05674483
 6:   Feb   a -0.05674483
 7:   Mar   a  0.24560510
 8:   Mar   a  0.24560510
 9:   Mar   a  0.24560510
10:   Jan   b -0.58444588
11:   Jan   b -0.58444588
12:   Jan   b -0.58444588
13:   Feb   b  0.26415972
14:   Feb   b  0.26415972
15:   Feb   b  0.26415972
16:   Mar   b  0.82548140
17:   Mar   b  0.82548140
18:   Mar   b  0.82548140
19:   Jan   c -0.15472876
20:   Jan   c -0.15472876
21:   Jan   c -0.15472876
22:   Feb   c -1.62346772
23:   Feb   c -1.62346772
24:   Feb   c -1.62346772
25:   Mar   c  0.18721114
26:   Mar   c  0.18721114
27:   Mar   c  0.18721114
    month fac       value

是否有一种优雅的方法可以将这些结果折叠为每个独特的因子组合与数据表的一行？

问题(和推理)与聚合值的分配不仅仅是计算有关.

如果你查看一个包含更多列而不仅仅是用于计算的列的data.table,则更容易观察到这一点.

# Therefore,let's add a new column
dtb[,newCol := LETTERS[seq(length(value))]

请注意,如果我们只想输出计算值,那么RHS上的表达式就好了.

# This gives the expected results
dtb[,mean (value),fac)]

# This on the other hand assigns the respective values to *each* row
dtb[,fac)]

换句话说,数据被子集化为仅返回唯一值.
但是,如果要将此值保存回SAME数据表(使用：=运算符时会发生这种情况)
然后,将为i中标识的所有行(defualt的所有行)分配一个值. (当你用附加列查看输出时,这是有道理的)

然后将此data.table复制到agg仍然会通过所有行发送.

因此,如果要复制到新表,只能从原始表中那些唯一的行,即可

a.  wrap the original table inside `unique()` before assigning it
b.  assign the table,above,that is returned when you 
    are not assigning the RHS output (which is what @Arun suggested)

一个例子.将会：

agg2 <- unique(dtb[,fac)])

以下示例可能有助于说明.

(你需要复制粘贴,因为省略了输出)

# SAMPLE DATA,as above
  library(data.table)
  dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3],value = rnorm (27))

  #  METHOD 1  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore,from sample data.


  dtb[,fac)]
  dtb

  # this is what you would like to assign
  unique(dtb)


  #  METHOD 2  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore,from sample data.

  # this is what you would like to assign
  # next two lines are the same,only differnce is column name
  dtb[,fac)]
  dtb[,list("mean" = mean (value)),fac)]  # quote marks added for clarity

  # dtb is unchanged. 
  dtb



  # NOW COMPARE THE SAME TWO METHODS,BUT IF THERE IS AN ADDITIOANL COLUMN
  dtb.bak[,newCol := rep(c("A","B","A"),length(value)/3)]


  dtb1 <- copy(dtb.bak)  # restore,from sample data.
  dtb2 <- copy(dtb.bak)  # restore,from sample data.


  # Method 1
  dtb1[,fac)]
  dtb1
  unique(dtb1)

  #  METHOD 2  # 
  dtb2[,fac)]  # quote marks added for clarity
  dtb2

  # METHOD 2,WITH ADDED COLUMNS IN list() in `j`
  dtb2[,list("mean" = mean (value),newCol),fac)]  # quote marks added for clarity
  # notice this has more columns thatn 
  unique(dtb1)

（编辑：安卓应用网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!