技术标签: allocation optimization fortran integer compiler interface
在安装HPL之前,系统中必须已经安装了编译器、并行环境MPI以及基本线性代数子方程(BLAS)或矢量图形信号处理库(VSIPL)两者之一。
编译器必须支持C语言和Fortran77语言。并行环境MPI一般采用MPICH,当然也可以是其它版本的MPI,如LAM-MPI。HPL运行需要BLAS库或者VSIPL库,且库的性能对最终测得的Linpack性能有密切的关系。常用的BLAS库有GOTO、Atlas、ACML、ESSL、MKL等,
并行环境MPI我采用的是安装Infi-MPI,BLAS库我选择的是GotoBLAS。HPL是从 [url]www.netlib.org/benchmark/hpl[/url] 网站上下载HPL包hpl.tar.gz,目前HPL的最新版本为2.0。
使用root帐户
具体步骤如下:
一. Goto Blas 的安装 (GOTOBLAS)2007-07-07 18:29下载
GotoBLAS-1.15.tar.gz
1.cp GotoBLAS-1.15.tar.gz /usr/local/share/
tar xzvf GotoBLAS-1.15.tar.gz
cd GotoBLAS
2.如果机器是32位的
./quickbuild.32bit
64位的,则运行 ./quickbuild.32bit
3、编辑Makefile.rule,详细情况见附件;更改getarch.c里面的archtecture,使之符合自己的情况,即选择自己机器的相应配置。
Makefile.rule
#
# Beginning of user configuration
#
# This library's version
REVISION = -r1.26
# Which C compiler do you prefer? Default is gcc.
C_COMPILER = GNU
# C_COMPILER = INTEL
# C_COMPILER = PGI
# Now you don't need Fortran compiler to build library.
# If you don't spcifly Fortran Compiler, GNU g77 compatible
# interface will be used.
# F_COMPILER = G77
# F_COMPILER = G95
# F_COMPILER = GFORTRAN
F_COMPILER = INTEL
# F_COMPILER = PGI
# F_COMPILER = PATHSCALE
# F_COMPILER = IBM
# F_COMPILER = COMPAQ
# F_COMPILER = SUN
# F_COMPILER = F2C
# If you need 64bit binary; some architecture can accept both 32bit and
# 64bit binary(X86_64, SPARC, Power/PowerPC or WINDOWS).
#BINARY64 = 1
# If you want to build threaded BLAS
SMP = 1
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by script.
MAX_THREADS = 16
# If you want to use legacy threaded Level 3 implementation.
# Some architecture prefer this algorithm, but it's rare.
# USE_SIMPLE_THREADED_LEVEL3 = 1
# If you want to use GotoBLAS with accerelator like Cell or GPGPU
# This is experimental and currently won't work well.
# USE_ACCERELATOR = 1
# Define accerelator type (won't work)
# USE_CELL_SPU = 1
# Theads are still working for a while after finishing BLAS operation
# to reduce thread activate/deactivate overhead. You can determine
# time out to improve performance. This number should be from 4 to 30
# which corresponds to (1 << n) cycles. For example, if you set to 26,
# thread will be running for (1 << 26) cycles(about 25ms on 3.0GHz
# system). Also you can control this mumber by GOTO_THREAD_TIMEOUT
# CCOMMON_OPT += -DTHREAD_TIMEOUT=26
# If you need cross compiling
# (you have to set architecture manually in getarch.c!)
# Example : HOST ... G5 OSX, TARGET = CORE2 OSX
# CROSS_SUFFIX = i686-apple-darwin8-
# CROSS_VERSION = -4.0.1
# CROSS_BINUTILS =
# If you need Special memory management;
# Using HugeTLB file system(Linux / AIX / Solaris)
# HUGETLB_ALLOCATION = 1
# Using bigphysarea memory instead of normal allocation to get
# physically contiguous memory.
# BIGPHYSAREA_ALLOCATION = 1
# To get maxiumum performance with minimum impact to the system,
# mixing memory allocation may be worth to try. In this case,
# you have to define one of ALLOC_HUGETLB or BIGPHYSAREA_ALLOCATION.
# Another allocation will be done by mmap or static allocation.
# (Not implemented yet)
# MIXED_MEMORY_ALLOCATION = 1
# Using static allocation instead of dynamic allocation
# You can't use it with ALLOC_HUGETLB
STATIC_ALLOCATION = 1
# If you want to use CPU affinity
# CCOMMON_OPT += -DUSE_CPU_AFFINITY
# If you want to use memory affinity (NUMA)
# You can't use it with ALLOC_STATIC
# NUMA_AFFINITY = 1
# If you want to use interleaved memory allocation.
# Default is local allocation(it only works with NUMA_AFFINITY).
# CCOMMON_OPT += -DINTERLEAVED_MAPPING
# If you want to drive whole 64bit region by BLAS. Not all Fortran
# compiler supports this. It's safe to keep comment it out if you
# are not sure.
# INTERFACE64 = 1
# If you have special compiler to run script to determine architecture.
GETARCH_CC +=
GETARCH_FLAGS +=
#
# End of user configuration
#
ifdef BINARY32
BINARY64 =
endif
ifndef GOTOBLAS_MAKEFILE
export GOTOBLAS_MAKEFILE = 1
MACHINE =
OSNAME =
PGCPATH =
ARCH =
SUBARCH =
ARCHSUBDIR =
CONFIG =
FU =
LIBSUBARCH =
CORE =
endif
ifndef MACHINE
MACHINE := $(shell uname -m | sed -e s/i.86/i386/)
endif
ifndef OSNAME
OSNAME := $(shell uname -s | sed -e s//-.*//)
endif
ifneq ($(OSNAME), Darwin)
ifneq ($(OSNAME), CYGWIN_NT)
ifeq ($(MACHINE), i386)
BINARY64 =
NATIVEARCH = YES
endif
endif
endif
ifeq ($(MACHINE), ia64)
BINARY64 = YES
NATIVEARCH = YES
endif
ifeq ($(MACHINE), alpha)
BINARY64 = YES
NATIVEARCH = YES
endif
ifeq ($(OSNAME), AIX)
NATIVEARCH = YES
GETARCH_FLAGS += -maix64
endif
ifeq ($(OSNAME), Darwin)
ifndef BINARY64
NATIVEARCH = YES
endif
EXTRALIB += -lSystemStubs
endif
# If you need to access over 4GB chunk on 64bit system.
ifdef BINARY64
CCOMMON_OPT += -D__64BIT__
GETARCH_FLAGS += -D__64BIT__
ifdef INTERFACE64
CCOMMON_OPT += -DUSE64BITINT
endif
endif
文章浏览阅读881次。uni-app 之 解决u-button始终居中问题_u-button
文章浏览阅读1w次,点赞57次,收藏170次。区域候选网络RPN(Region Proposal Network)的学习、理解Anchors详解`generate_anchors.py`文件中generate_anchors函数源码解读Anchor的产生过程:RPN(Region Proposal Network):它是首先在Faster RCNN中提出的,其 核心功能: 1. 得到用来预测的feature map。 具体而言:图像在输入网络后,依次经过一系列卷积+ReLU得到了51×39×25651×39×25651×39×256维的featu_region proposal
文章浏览阅读3.9k次,点赞2次,收藏13次。什么是正排索引?什么是倒排索引?搜索的过程是什么样的?会用到哪些算法与数据结构? 前面的内容太宏观,为了照顾大部分没有做过搜索引擎的同学,数据结构与算法部分从正排索引、倒排索引一点点开始。提问:什么是正排索引(forward index)?回答:由key查询实体的过程,是正排索引。用户表:t_user(uid, name, passwd, age, sex),由uid查询整行的过程,就是正排索引查..._lucene 倒排表 时间复杂度
文章浏览阅读789次。kibana数据源定义Discussion around ATT&CK often involves tactics, techniques, procedures, detections, and mitigations, but a significant element is often overlooked: data sources. Data sources for every..._kibana 数据源
文章浏览阅读2.6k次,点赞3次,收藏15次。包含命令以及简单的文字说明,巩固记忆,并方便以后检索,包含调用方式一些简短说明。经历包括包括《R语言与机器学习》,夹杂一些《R语言实战》以及参考一些网络大神的博客。关于model的summary的说明:星号(***)表示预测能力,显著性水平,星号越多,显著性水平越低,相关的可能越小。多元R方值(判定系数)代表着从整体上,模型能多大程度上解释因变量的值,类似于相关系数。F检验:线性模型输出显示中的F_r语言adaboost.m1
文章浏览阅读634次。编译Linux内核方面的内容_编译linux内核
文章浏览阅读7k次,点赞13次,收藏55次。新购入一块开发板 正点原子的STM32F429核心板+底板写这个 专栏博客 STM32F429专栏 记录学习这块板子的全过程 持续更新有兴趣的UU们一起来学习吧 对于没学过有兴趣想学习STM32的UU有什么不清楚可以私信我接下来一起开启学习之旅吧硬件平台正点原子STM32F429软件平台正点原子和野火视频教程正点原子 寄存器/HAL库代码和野火 固件标准库代码&还有自写代码开发板底板鸟瞰图:核心板鸟瞰图:MCU丝印型号详解:总线架构:存储器映射:存储_stm32f429igt6中文说明
文章浏览阅读2.7k次。整理 | 孙胜 出品 | CSDN(ID:CSDNnews)Faker 是一个流行的模拟数据生成库,程序员只需简单地几步操作,就可以在浏览器和 Node.js 中生成大量的假数..._1.2亿次下载,近3万star的开源项目是为何会“死”掉?
文章浏览阅读3.5k次。断言 / 先决条件你需要在代码中包含#include<sol/sol.hpp> ,这只是一个头文件,不需要编译,但是你的lua 必须是编译可用的。断言如下:#ifndef EXAMPLES_ASSERT_HPP#define EXAMPLES_ASSERT_HPP# define m_assert(condition, message) \ do { \ if (! (condition)) { \ std::ce..._sol2
文章浏览阅读635次,点赞19次,收藏10次。传统处理数据,必须是一张张纸,然后处理完毕又是统计在一张张纸上面,不断的重复处理,最终有个结果给最高层作为参考,这个模式在互联网没有出现之前,是一种常见的事情,信息管理的效率提不上去,人多不一定力量大,因为人多肯定更加消耗资源,并且因为人类需要休息,需要管理,思想会不统一,会偷懒,所以人们研究出专门帮助人们计算的机器,就是计算机的前身,到了互联网时代,人们发现完全可以让程序供应商提供解决方案,自己挑选自己合适的方案来提高自己的产出比。如果正常操作都会出现问题,那设计就是不稳定的,这一点肯定不行。
文章浏览阅读1.1k次,点赞21次,收藏21次。本技术报告的重点是(1)将所有类型的视觉数据转化为统一表示,从而能够大规模训练生成模型的方法;以及(2)对Sora的能力和局限性的定性评估。_openai-sora+技术文档
文章浏览阅读889次,点赞20次,收藏19次。1.背景介绍在PyTorch中搭建神经网络的基础知识1. 背景介绍深度学习是一种通过多层神经网络来处理复杂数据的技术。它已经成为了人工智能领域的核心技术之一,并在图像识别、自然语言处理、语音识别等领域取得了显著的成果。PyTorch是一个流行的深度学习框架,它提供了易于使用的API来构建、训练和部署神经网络。本文将涵盖PyTorch中神经网络的基础知识,包括核心概念、算法原理、最佳实践...